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ABSTRACT 

The WIT (What Is There) (http://wit.mcs.anl.gov/WIT2/ ) 
system has been designed to support comparative 
analysis of sequenced genomes and to generate 
metabolic reconstructions based on chromosomal 
sequences and metabolic modules from the EMP/MPW 
family of databases. This system contains data 
derived from about 40 completed or nearly completed 
genomes. Sequence homologies, various ORF- 
clustering algorithms, relative gene positions on the 
chromosome and placement of gene products in 
metabolic pathways (metabolic reconstruction) can 
be used for the assignment of gene functions and for 
development of overviews of genomes within WIT. 
The integration of a large number of phylogenetically 
diverse genomes in WIT facilitates the understanding 
of the physiology of different organisms. 

INTRODUCTION 

Starting with Haemophilus influenza (1) in 1995, over 20 
microbial organisms have had their total genomic DNA 
sequenced and almost 100 others have been started as shown in 
the GOLD database (2). Currently we are observing an impressive 
development of the human genome project (3,4). In response 
to this growing amount of sequence data, computational tools 
for genome analysis have been developed and merged into 
shared analytical environments, such as GeneQuiz (5), KEGG 
(6), Pedant (7) and Entrez Genomes (8), moving cross-genome 
analysis to a new level. The development of analytical 
systems, together with the growth of sequencing data, have 
increased gene recognition rates from <50% (9,10) to >70% 
(11,12). Today, this remaining 30%, so-called 'hypothetical' 
or 'orphan' genes, separates us from a complete description of 
the genomic content and functions of an organism. 

Computational approaches based on various types of clustering 
of potential genes, whether in phylogenetic space, as clusters of 
orthologous genes (COGs) (13) or position on the chromosome, 
such as in operons(14), increase the gene assignment level 
even further. An important stage of genome analysis is the 
integration of gene assignments into an organism-specific 
overview via so-called functional reconstruction (15), which is 



the conceptual assembly of metabolic pathways, transport 
units and signal transduction pathways. It allows reconciliation 
of inconsistencies between different types of analysis, and 
often results in changes of initial gene function assignments 
based on similarity scoring. 

The WIT system, discussed in this paper, represents the 
development of a genome analysis strategy in a multi-genome 
environment, which combines a variety of tools, dealing with 
individual open reading frames (ORFs) or proteins, with the 
ability to derive general conclusions. Using the WIT genome 
analysis system, a major part of the central metabolism of an 
organism can be reconstructed entirely in silico (16). 

WIT: A VIEW TO A GENOME 

The current version of the WIT system is available at Argonne 
National Laboratory (http://wit.mcs.anl.govAVIT2 ) or at Integrated 
Genomics Inc. (http://wit.IntegratedGenomics.com/IGwit ) and 
contains 43 complete or nearly complete genomes (Table 1). 

These genomes consist of 123 482 predicted ORFs, of which 
78 144 could be given functional assignments and 41 742 
could be assembled into metabolic pathways, which came from 
EMP/MPW database (15). Pathways involved in the metabolism 
of carbohydrates and amino acids are connected into schematic 
overviews allowing the user to reveal substrates and final products 
connecting metabolic modules. 

In order to incorporate a genome into WIT, a gene-searching 
program called CRITICA (17) can be used. Potential coding 
regions recognized in the DNA contigs are subjected to a 
FASTA search against the non-redundant database of assigned 
genes and loaded into the WIT system, together with the pre- 
computed tables of best hits. 

WIT provides a set of tools for the characterization of gene 
structures and functions, such as Functional Coupling, or 
Preserved Operons. WIT also provides integrated WWW 
access to such tools as PSI-BLAST, PROSITE, ProDom, 
COG, ClustalW and others. Functional content may be 
queried, for example, by looking for specific functions missing 
in the metabolic pathways, or by separating alternative gene 
functions derived from similarities found for a putative gene. 

After genes have been assigned initial functions, they are 
then 'attached' to pathways by choosing templates from metabolic 
database (MPW) which best incorporate all observed functions. For 
any given organism, this usually leads to identification of 
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Table 1. Genomes in WIT 



Eukarya Saccharomyces cerevisiae, Caenorhabditis elegans 

Archaea Sulfolobus solfataricus, Archaeoglobus fulgidus t Halobacterium sp., M.thermoautotrophicum, Mjannaschii, Pyrococcus 

furiosus, Pyrococcus horikoshii 

Bacteria A.aeolicus, C.trachomatis, Synechocystis sp., Pgingivalis, M.Ieprae, M .tuberculosis, B.subtilis, C.acetobutylicum, 

E.faecalis, M.genitalium, M. pneumoniae, S. pneumoniae, S. pyogenes, Rhizobium sp., R.capsulatus, S.aromaticivorans, 
N. gonorrhoeae, N. meningitidis, C.jejuni, H. pylori, E.coli, Y.pest is, H. influenzae, Raeruginosa, B. burgdorferi, T.pallidum, 
D.radiodurans 

Additional Genomes on the public A.pernix, M.bovis, C.tepidum, S.typhi, T.maritima, A.actinomycetemcomitans, E.nidulans, Oryza sativa, A thaliana, 
server at Integrated Genomics Inc. R.prowazekii, Pabysii, C.pneumoniae, C.reinhardtii 



functional sub-systems, as a model for further refinement. For 
example, it is now possible to identify inconsistencies, potentially 
missing enzymes/ORFs, thereby refining the model. When a 
basic model has been created, a curator finally evaluates this 
model against biochemical data and phenotypes known from 
the literature. The models come in both textual and graphical 
representations, fully linked with all underlying data. We call 
this whole process metabolic reconstruction, and the main role 
of the WIT system is to support this effort. 

To examine or curate a functional model of an organism, one 
can use functions such as: Compare assignments, Summary of 
asserted functions and pathways, Examine trimmed ortholog 
clusters, Examine COG/trimmed ortholog cluster relationships, 
Search for pathways by regular expression, Search ORF functions 
by regular expression, Search ORF sequences by similarity 
search, Find NCBFs MEDLINE-references by EC-number, 
Search EMP by EC-number, and Find common proteins for 
organisms. Chromosomal clustering of functionally related 
genes (14) is another powerful component of the system, 
which recently allowed us to propose a number of candidate 
ORFs for 'orphan' metabolic functions. Continuous integration of 
newly sequenced genomes increases the depth of functional 
description by a reiterative process. 

GAPPED GENOMES IN WIT 

An important feature of the WIT system is its emphasis on 
incomplete or gapped genomes. Algorithms used for gene 
assignments depend on the size of a dataset used to cluster 
properties of ORFs, whether it is chromosomal position or 
ortholog clustering based on bi-directional best hits. By the 
incorporation of gapped genomes, even the public version of 



WIT has integrated twice as much data as can be collected 
from only the completed genomes. 

We believe that integrating systems like WIT can offer a 
solution for the problem of efficient use of incomplete 
sequence data. The gapped sequence contains a piece of almost 
every ORF, which allows the assignment of functions to almost 
all ORFs and the accurate reconstruction of the metabolism of the 
organism; good informatics can compensate for poorer 
sequence quality. A comparison of the results of analysis of the 
gapped genome of Pseudomonas aeruginosa with the 
complete genomes of Escherichia coli and Bacillus subtilis 
proves this statement (Table 2). 

CONCLUSIONS 

WIT has been designed to extract functional content from 
genome sequences and organize it into a coherent system, in 
order to facilitate post-sequencing experimental biology. The 
WIT system provides a set of local tools, which can be used to 
investigate functions of individual ORFs, based on similarities, 
motifs and various types of ORF clustering. It also generates 
overviews of functional subsystems and means to connect 
them into a complete picture of cellular functionality. 

The WIT system is undergoing constant improvements, which 
can be traced in the PUMA-WIT-WIT2 line of development, 
and we believe that numerous further additions are needed to 
provide an adequate toolbox for the biological research 
community. Major directions of the ongoing WIT development 
are the following: (i) integration of structural data, which are 
currently underutilized in WIT; (ii) further development of the 
collection of functional maps and construction of more abstract 
scalable overviews, which should eventually cover all cellular 
functionality, and; (iii) development of a framework, which 



Table 2. Comparison of the gapped Paeruginosa genome with those of E.coli K-12 and B.subtilis 168 





Pseudomonas aeruginosa 


Escherichia coli 


Bacillus subtilis 


Genome Size (Mb) 


6.2 


4.7 


4.1 


DNA assembled (%) 


99 


100 


100 


Total ORFs 


5627 


4289 


4083 


Assigned ORFs 


4191 


3499 


3016 


Asserted pathways 


581 


906 


782 


Missing assignments 


133 


102 


178 


No sequences 


115 


233 


173 
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will integrate a flood of the differential display expression 
array data into the metabolic context. 
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ABSTRACT We have developed three computer pro- 
grams for comparisons of protein and DNA sequences. They 
can be used to search sequence data bases, evaluate similarity 
scores, and identify periodic structures based on local se- 
quence similarity. The FASTA program is a more sensitive 
derivative of the FASTP program, which can be used to search 
protein or DNA sequence data bases and can compare a 
protein sequence to a DNA sequence data base by translating 
the DNA data base as it is searched. FASTA includes an 
additional step in the calculation of the initial painvise simi- 
larity score that allows multiple regions of similarity to be 
joined to increase the score of related sequences. The RDF2 
program can be used to evaluate the significance of similarity 
scores using a shuffling method that preserves local sequence 
composition. The LFASTA program can display all the re- 
gions of local similarity between two sequences with scores 
greater than a threshold, using the same scoring parameters 
and a similar alignment algorithm; these local similarities can 
be displayed as a "graphic matrix" plot or as individual 
alignments. In addition, these programs have been generalized 
to allow comparison of DNA or protein sequences based on a 
variety of alternative scoring matrices. 



We have been developing tools for the analysis of protein 
and DNA sequence similarity that achieve a balance of 
sensitivity and selectivity on the one hand and speed and 
memory requirements on the other. Three years ago, we 
described the FASTP program for searching amino acid 
sequence data bases (1), which uses a rapid technique for 
finding identities shared between two sequences and exploits 
the biological constraints on molecular evolution. FASTP 
has decreased the time required to search the National 
Biomedical Research Foundation (NBRF) protein sequence 
data base by more than two orders of magnitude and has 
been used by many investigators to find biologically signifi- 
cant similarities to newly sequenced proteins. There is a 
trade-off between sensitivity and selectivity in biological 
sequence comparison: methods that can detect more dis- 
tantly related sequences (increased sensitivity) frequently 
increase the similarity scores of unrelated sequences (de- 
creased selectivity). In this paper we describe a new version 
of FASTP, FASTA, which uses an improved algorithm that 
increases sensitivity with a small loss of selectivity and a 
negligible decrease in speed. We have also developed a 
related program, LFASTA, for local similarity analyses of 
DNA or amino acid sequences. These programs run on 
commonly available microcomputers as well as on larger 
machines. 

METHODS 

The search algorithm we have developed proceeds through 
four steps in determining a score for pair- wise similarity. 

The publication costs of this article were defrayed in part by page charge 
payment. This article must therefore be hereby marked "advertisement 
in accordance with 18 U.S.C. §1734 solely to indicate this fact. 



FASTP and FASTA achieve much of their speed and selec- 
tivity in the first step, by using a lookup table to locate all 
identities or groups of identities between two DNA or amino 
acid sequences during the first step of the comparison (2). 
The ktup parameter determines how many consecutive iden- 
tities are required in a match. For example, if ktup = 4 for a 
DNA sequence comparison, only those identities that occur 
in a run of four consecutive matches are examined. In the 
first step, the 10 best diagonal regions are found using a 
simple formula based on the number of ktup matches and the 
distance between the matches without considering shorter 
runs of identities, conservative replacements, insertions, or 

deletions (1, 3). 

In the second step of the comparison, we rescore these 10 
regions using a scoring matrix that allows conservative 
replacements and runs of identities shorter than ktup to 
contribute to the similarity score. For protein sequences, 
this score is usually calculated using the PAM250 matrix (4), 
although scoring matrices based on the minimum number of 
base changes required for a replacement or on an alternative 
measure of similarity can also be used with FASTA. For 
each of these best diagonal regions, a subregion with maxi- 
mal score is identified. We will refer to this region as the 
"initial region"; the best initial regions from Fig. 1A are 

shown in Fig. IB. . . . 

The FASTP program uses the single best scoring initial 
region to characterize pair-wise similarity ; the initial scores 
are used to rank the library sequences. FASTA goes one 
step further during a library search; it checks to see whether 
several initial regions may be joined together. Given the 
locations of the initial regions, their respective scores, and a 
"joining" penalty (analogous to a gap penalty), FASTA 
calculates an optimal alignment of initial regions as a com- 
bination of compatible regions with maximal score. FASTA 
uses the resulting score to rank the library sequences. We 
limit the degradation of selectivity by including in the 
optimization step only those initial regions whose scores are 
above a threshold. This process can be seen by comparing 
Fig. IB with Fig. 1C. Fig. IB shows the 10 highest scoring 
initial regions after rescoring with the PAM250 matrix; the 
best initial region reported by FASTP is marked with an 
asterisk. Fig. 1C shows an optimal subset of initial regions 
that can be joined to form a single alignment. 

In the fourth step of the comparison, the highest scoring 
library sequences are aligned using a modification of the 
optimization method described by Needleman and Wunsch 
(5) and Smith and Waterman (6). This final comparison 
considers all possible alignments of the query and library 
sequence that fall within a band centered around the highest 
scoring initial region (Fig. ID). With the FASTP program, 
optimization frequently improved the similarity scores of 
related sequences by factors of 2 or 3. Because FASTA 
calculates an initial similarity score based on an optimization 
of initial regions during the library search, the initial score is 

Abbreviation: NBRF, National Biomedical Research Foundation. 
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Fig 1 Identification of sequence similarities by PASTA. The 
four steps used by the FASTA program to calculate the initial and 
optimal similarity scores between two sequences are sta-W 
Identify regions of identity. (B) Scan the regions using a scoring 
matrix and save the best initial regions. Initial regions with scores 
Kan the joining threshold (27) are 

the highest scoring region reported by FASTP. C) OptimaHy °' n 
initial regions with scores greater than a threshold. The 
denote regions that are joined to make up the 
(D) Recalculate an optimized alignment centered around the highest 
scoring initial region. The dotted lines denote the bounds of he 
optimized alignment. The result of this alignment is reported as the 
optimized score. 

much closer to the optimized score for many sequences. In 
fact, unlike FASTP, the FASTA method may yield initial 
scores that are higher than the corresponding optimized 

S ° Local Similarity Analyses. Molecular biologists are often 
interested in the detection of similar subsequences within 
longer sequences. In contrast to FASTP and FASTA, which 
report only the one highest scoring alignment between two 
sequences, local sequence comparison tools can identify 
multiple alignments between smaller portions of two se- 
quences. Local similarity searches can clearly show the 
results of gene duplications (see Fig. 2) or repeated struc- 
tural features (see Fig. 3) and are frequently displayed using 
a "graphic matrix" plot (7), which allows one to detect 
regions of local similarity by eye. Optimal algorithms for 
sensitive local sequence comparison (6, 8, 9) can nave 
tremendous computational requirements in time and mem- 
ory which make them impractical on microcomputers and, 
when comparing longer sequences, on larger machines as 

well . 

The program for detecting local similarities, LFASTA, 
uses the same first two steps for finding initial regions that 
FASTA uses. However, instead of saving 10 initial regions, 
LFASTA saves all diagonal regions with similarity scores 
greater than a threshold. LFASTA and FASTA also differ in 
the construction of optimized alignments. Instead of focus- 
ing on a single region, LFASTA computes a local alignment 
for each initial region. Thus LFASTA considers all of the 
initial regions shown in Fig. IB, instead of just the diagona 
shown in Fig. ID. Furthermore, LFASTA considers not 
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only the band around each initial region but also potential 
sequence alignments for some distance before and after the 
initial region. Starting at the end of the initial region, an 
optimization (6) proceeds in the reverse direction until all 
possible alignment scores have gone to zero. The location of 
the maximal local similarity score in the reverse direction is 
then used to start a second optimization that proceeds in the 
forward direction. An optimal path starting from the forward 
maximum is then displayed (5). The local homologies can be 
displayed as sequence alignments (see Fig. 2B) or on a 
two-dimensional graphic matrix style plot (see Figs. 2A and 

3). 

Statistical Significance. The rapid sequence comparison 
algorithms we have developed also provide additional tools 
for evaluating the statistical significance of an alignment. 
There are approximately 5000 protein sequences, with 1.1 
million amino acid residues, in the NBRF protein sequence 
library, and any computer program that searches the library 
by calculating a similarity score for each sequence in the 
library will find a highest scoring sequence, regardless ot 
whether the alignment between the query and library se- 
quence is biologically meaningful or not. Accompanying tne 
previous version of FASTP was a program for the evaluation 
of statistical significance, RDF, which compares one se- 
quence with randomly permuted versions of the potentially 

re wf haTwritten a new version of RDF (RDF2) that has 
several improvements, (i) RDF2 calculates three scores for 
each shuffled sequence: one from the best single initial region 
(as found by FASTP), a second from the joined initial regions 
(used by FASTA), and a third from the optimized diagonal. 
07) RDF2 can be used to evaluate amino acid or UNA 
sequences and allows the user to specify the scoring matrix to 
be employed. Thus sequences found using the PAM25U 
scoring matrix can be evaluated using the identity or genetic 
code matrix. (Hi) The user may specify either a global or local 

shuffle routine. . 

Locally biased amino acid or nucleotide composition is 
perhaps the most common reason for high similarity scores 
of dubious biological significance (10). High scoring align- 
ments between query and library sequences may be due to 
patches of hydrophobic or charged amino acid residues or to 
A+T- or G+C-rich regions in DNA. A simple Monte Carlo 
shuffle analysis that constructs random sequences by taking 
each residue in one sequence and placing it randomly along 
the length of the new sequence will break up these patches ot 
biased composition. As a result, the scores of the shuffled 
sequences may be much lower than those of the unshurned 
sequence, and the sequences will appear to be related. 
Alternatively, shuffled sequences can be constructed by 
permuting small blocks of 10 or 20 residues so that, while the 
order of the sequence is destroyed, the local composition is 
not By shuffling the residues within short blocks along the 
sequence, patches of G+C- or A+T-rich regions in DNA, 
for example, are undisturbed. Evaluating significance with a 
local shuffle is more stringent than the global approach, and 
there may be some circumstances in which both should be 
used in conjunction. Whereas two proteins that share a 
common evolutionary ancestor may have clearly significant 
similarity scores using either shuffling strategy, proteins 
related because of secondary structure or hydropathic pro- 
file may have similarity scores whose significance decreases 
dramatically when the results of global and local shuffling 

are compared. . 

Implementation. The FASTA/LFASTA package of se- 
quence analysis tools is written in the C programming lan- 
euage and has been implemented under the Unix, VAX/ 
VMS and IBM PC DOS operating systems. Versions of the 
program that run on the IBM PC are limited to query se- 
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Table 1. FASTA and FASTP initial scores of the T-cell receptor 
(RWMSAV) versus the NBRF data base 

Initial score 

NBRF code Sequence FASTA FASTP 

RWHUAV T-cell receptor a chain 155 98 

K1HURE Ig k chain V-I region 127 111 

KVMS50 Ig k chain V region 149 62 

KVMSM6 Ig k chain precursor V regions 141 64 

KVRB29 Ig * chain V region 126 54 

L3HUSH Ig A chain V III region 90 47 

KVMS41 Ig k chain precursor V region 87 87 

RWMSBV T-cell receptor 0-chain precursor 94 94 

RWHUVY T-cell receptor 0-chain precursor 91 59 

RWHUGV T-cell receptor y-chain precursor 87 61 

RWHUT4 T-cell surface glycoprotein T4 86 63 

RWMSVB T-cell receptor Y-chain precursor 71 41 

HVMS44 Ig heavy-chain V region 67 36 

G1HUDW Ig heavy-chain V-II region 62 35 

The average FASTP score = 26.1 ± 6.8 (mean ± SD). The 
average FASTA score = 26.2 ± 7.2 (mean ± SD). The mean and 
SD were computed excluding scores >54. V, Variable. 

quences of 2000 residues; library sequences can be any 
length. Copies of the program are available from the authors. 

Although FASTA and LFASTA were designed for protein 
and DNA sequence comparison, they use a general method 
that can be applied to any alphabet with arbitrary match/ 
mismatch scoring values. All the scoring parameters, includ- 
ing match/mismatch values, values for the first residue in a 
gap and subsequent residues in the gap, and other parame- 
ters that control the number of sequences to be saved and 
the histogram intervals, can be specified without changing 
the program. 

EXAMPLES 

Comparison of FASTA with FASTP. To demonstrate the 
superiority of the FASTA method for computing the initial 
score, we compared the protein sequence of a T-cell receptor 
a chain (NBRF code RWMSAV) with all sequences in the 
NBRF protein data base* and computed initial scores with 
both the present and previous methods. The T-cell receptor is 
a member of the immunoglobulin superfamily ; in Release 12.0 
of the data base, this superfamily has 203 members. FASTP 
placed 160 immunoglobulin superfamily sequences in the 200 
top-scoring sequences; 57 related sequences received initial 
scores less than four standard deviations above the mean 
score. FASTA placed 180 superfamily members in the 200 
top-scoring sequences; only 20 related sequences scored 
below four standard deviations above the mean. Table 1 con- 
tains specific examples from this data base search. Although 
there is often little difference in the two methods, this ex- 
ample shows that in a number of cases the new method ob- 
tains significantly higher scores between related sequences. 

Nucleic Acid Data Base Search. FASTA can also be used to 
search DNA sequence data bases, either by comparing a 
DNA query sequence to the DNA library or by comparing an 
amino acid query sequence to the DNA library by translating 
each library DNA sequence in all six possible reading 
frames. We compared the 660-nucleotide rat transforming 
growth factor type a mRNA (GenBank locus RATTGFA) 
with all the mammalian sequences in Release 48 of Gen- 
Ban k§. We set ktup = 4 (see Methods), and the search was 
completed in under 15 min on an IBM PCAT microcom- 



tProtein Identification Resource (1987) Protein Sequence Database 
(Natl. Biomed. Res. Found., Washington, DC), Release 12. 

§EMBL/GenBank Genetic Sequence Database (1987) (Intelligenet- 
ics. Mountain View, CA), Tape Release 48. 



Table 2. DNA data base search of rat transforming growth factor 
(RATTGFA) versus mammalian sequences 



GenBank 






Score 


locus 


Sequence 


Initial 


Optimized 


HUMTFGAM 


Human TGF mRNA 


1336 


1618 


HUMTGFA2 


Human TGF gene (exon 2) 


354 


366 


HUMTGFA1 


Human TGF Rene (5' end) 


224 


381 


MUSRGEB3 


Mouse 18S-5.8S-28S rRNA 


140 


107 




gene 






MUSRGE52 


Mouse 18S-5.8S-28S rRNA 


140 


107 




gene 






MUSMHDD 


MHC class I H-2D 


122 


78 


HUMMETIF1 


Metallothionein (MT)I F gene 


116 


92 


MUSRGLP 


45S rRNA (5' end) 


115 


83 


HUMPS2 


pS2 mRNA 


105 


106 


MUSC1AI1 


a-1 type I procollagen 


86 


89 



The 10 sequences having the highest initial scores are given. TGF, 
transforming growth factor; MHC, major histocompatibility com- 
plex. 



puter. The 10 top-scoring library sequences are shown in 
Table 2. Although it can be seen that the 3 top-scoring 
sequences are clearly related to RATTGFA, there are other 
high-scoring sequences that are probably not related, and the 
mouse epidermal growth factor, found in the translated data 
base search (Table 3), is not found among the top-scoring 
sequences. 

To further examine the similarity detected between RAT- 
TGFA and MUSRGEB3, a mouse rRNA gene cluster, we 
used the RDF2 program for Monte Carlo analysis of statis- 
tical significance (the window for local shuffling was set to 10 
bases). Of the 50 shuffled comparisons (data not shown), 1 
obtained an initial score greater than 140 (the observed initial 
score), and 9 shuffled sequences obtained optimized scores 
greater than 107 (the observed optimized score). Therefore, 
the similarity between RATTGFA and MUSRGEB3 is un- 
likely to be significant. 

Translated Nucleic Acid Data Base Search. When searching 
for sequences that encode proteins, amino acid sequence 
comparisons are substantially more sensitive than DNA se- 
quence comparisons because one can use scoring matrices 
like the PAM250 matrix that discriminate between conserva- 
tive and nonconservative substitutions. A variant of FASTA, 
TFASTA, can be used to compare a protein sequence to a 
DNA sequence library; it translates the DNA sequences into 
each of six possible reading frames "on-the-fly." TFASTA 
translates the DNA sequences from beginning to end; it 
includes both intron and exon sequences in the translated 
protein sequence; termination codons are translated into 
unknown (X) amino acids. Table 3 shows the results of a 
translating search of the mammalian sequences in the Gen- 
Bank DNA data base using the RATTGFA protein sequence 
as the query and ktup = 1. In the translated search, the mouse 
epidermal growth factor now obtains an initial score higher 
than any unrelated sequences; however, HUMTGFA1, which 
was found in the DNA data base search but only contains 13 
translated codons, is no longer among the top scoring se- 
quences. 

Local Similarities. Fig. 2 displays the output of a local 
similarity analysis (ktup = 4) of CHPHBA1M, a chimpanzee 
al-globin mRNA, and RABHBAPT, a rabbit a-globin gene, 
including the complete coding sequence and a flanking 
pseudo-0 r globin gene. LFASTA can either display a graphic 
matrix style plot of the local homologies (Fig. 2A) or the 
alignments themselves (Fig. 2B). The right-most three align- 
ments (Fig. 2A) match the corresponding regions of the 
mRNA to exon subsequences from the pseudogene. We note 
that the FASTA initial score for the comparison of CHPH- 



RWHUAV T-cell receptor a chain 155 

K1HURE Ig k chain V-I region 127 

KVMS50 Ig k chain V region 149 

KVMSM6 Ig k chain precursor V regions 141 

KVRB29 Ig k chain V region 126 

L3HUSH Ig A chain V III region 90 

KVMS41 Ig k chain precursor V region 87 

RWMSBV T-cell receptor 0-chain precursor 94 

RWHUVY T-cell receptor 0-chain precursor 91 

RWHUGV T-cell receptor ^chain precursor 87 

RWHUT4 T-cell surface glycoprotein T4 86 

RWMSVB T-cell receptor ^chain precursor 71 

HVMS44 Ig heavy-chain V region 67 

G1HUDW Ig heavy-chain V-II region 62 



Biochemistry: Pearson and Lipman 
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Table 3. Translated DNA data base search of rat transforming growth factor (RATTGFA) versus 
mammalian sequences 



Hen Rank 
locus 


Seauence 


Frame 


Initial 


Score 

Optimized 


RATTGFA 


Rat TGF type o 


1 


olo 


olo 


HUMTGFAM 


Human TGF mRNA 


2 


671 


//0 


HUMTGFA2 


Human TGF gene 


1 


204 


205 


MUSEGF 


Mouse EGF mRNA 


3 


93 


129 


MUSMHAB3 


Mouse MHC class II H2-1A„ 


1 


91 


58 


MUSIGCD17 


Mouse Ig germ-line DJC region 


3' 


85 


48 


HUMESTR 


Human estrogen receptor 


3 


83 


65 


RATINSI 


Rat insulin 1 (Ins- 1) gene 


2 


81 


63 


MUSTHYSl 


Mouse thymidylate synthase 


2 


80 


63 


HUMPNU3 


Human purine nucleoside phosphorylase 


r 


80 


52 



The 10 sequences having the highest initial scores are given. TGF, transforming growth factor; EGF, 
epidermal growth factor; D, diversity; J, joining; C, constant; MHC, major histocompatibility 
complex. 



BA1M and RABHBAPT would be based on the three globin 
gene exons, while the FASTP initial score would be based on 
a single conserved exon. 

The Smith-Waterman optimization used in the LFASTA 
program allows the detection of more subtle features than 
can be detected by the eye using a graphic matrix plot, 
because the path traced is locally optimal, even though it 
may only have a slightly higher density of identities and 
conservative replacements. Fig. 3 shows a plot from a local 
similarity self-comparison of the myosin heavy chain from 
the nematode Caenorhabditis elegans (MWKW) using the 
PAM250 matrix. The amino-terminal half of the molecule 
forms a large globular head without any periodic structure; 
the solid line down the main diagonal represents the ex- 
pected identity of the sequence with itself. The symmetrical 
parallel lines along the carboxyl-terminal half of the mole- 
cule correspond to the 28-residue repeat responsible for the 
a-helical coiled-coil structure of the rod segment. 

DISCUSSION 

In searching a data base, one is attempting to measure 
relatedness; in aligning two homologous sequences, one is 



trying to choose the most likely set of mutations since their 
divergence from a common ancestral sequence. Thus any 
tool for the analysis of sequence similarities must contain 
within it an implicit model of molecular evolution. An 
algorithm that guarantees the optimality of its alignments 
based on a set of scoring rules must be judged on how well 
these rules fit our current understanding of the process of 
molecular evolution. Algorithms that sacrifice realism to 
achieve greater efficiency, regardless of their mathematical 
rigor, require careful empirical evaluation. 

Even though the tools we have developed use rigorous 
algorithms at each step and incorporate a realistic model of 
evolution, their hierarchical nature make them heuristic. The 
original FASTP program has had the benefit of extensive use 
and evaluation by a wide variety of scientists. The FASTA 
program exploits refinements of the previous approach that 
result in a significant improvement in sensitivity. The LFA- 
STA local similarity analysis program is also a logical ex- 
tension of the FASTP approach. 

Because of the trade-offs between sensitivity and selectiv- 
ity in data base searches, the results of any search, and 
particularly those that result in alignment scores that are not 
clearly separated from the distribution of all library sequence 
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B 

10 20 30 40 50 60 

CHPHBA GACTCAGAAAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCG 
:::: :::: ::: X:::::::::::::::::: :: :::::::::::: ::::: : : 
RABHBA GACTGAGAAGGAA-CCACCATGGTGCTGTCTCCCGCTGACAAGACCAACATCAAGACTG 

180 190 200 210 220 

70 80 90 100 110 

CHPHBA CCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGG 

RABHBA CCTGGGAAAAGATCGGCAGCCACGGTGGCGAGTATGGCGCCGAGGCCGTGGAGAGG 
230 240 250 260 270 280 

Fig. 2. Local comparison of an a-globin mRNA sequence with an a-globin gene cluster. An ape a r globin mRNA sequence (GenBank 
sequence CHPHBA1M) was compared with a rabbit a-globin gene sequence (RABHBAPT) containing a second pseudo-0-globin gene using the 
LFASTA program. (A) A plot of the homologous regions shared by the two sequences. (B) One of the alignments between the mRNA sequence 
and the rabbit a-globin gene (nucleotides 171-855). Three other alignments between the mRNA sequence and the a-globin gene and three 
alignments between the pseudo-0-globin gene (nucleotides 3200-3770) were calculated but are not shown. There is 84.3% identity in the 115 
nucleotide overlap. The initial region and optimized scores using LFASTA are 284 and 304, respectively. X denotes the ends of the initial region 
found by LFASTA. 
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Fig. 3. Repeated structure in the 
myosin heavy chain. LFASTA was used 
to compare the Caenorhabditis elegans 
myosin heavy chain protein sequence 
(NBRF code MWKW) with itself using 
the PAM250 scoring matrix. The solid » 
dashed, and dotted lines denote decreas- 
ing similarity scores. The solid lines had 
initial region scores greater than 80 and 
optimized local scores greater than 150; 
the longer dashed lines had initial region 
and optimized local scores greater than 
65 and 120, respectively, and the shorter 
dashed lines had initial region and opti- 
mized local scores greater than 50 and 
100, respectively. Homologous regions 
with lower scores are plotted with dots. 



scores, must be carefully evaluated (1, 11). The Monte Carlo 
analysis of statistical significance provided by a program 
such as RDF2 can often be critical in evaluating a borderline 
similarity. Previously we suggested ranges of z values [(ob- 
served score - mean of shuffled scores)/standard deviation 
of shuffled scores] corresponding to approximate signifi- 
cance levels. However the z values determined in a Monte 
Carlo analysis become less useful as the distribution of 
shuffled scores diverges from a normal distribution, as is 
found with FASTA. Therefore, we now focus on the highest 
scores of the shuffled sequences. For example, if in 50 
shuffled comparisons, several random scores are as high or 
higher than the observed score, then the observed similarity 
is not a particularly unlikely event. One can have more 
confidence if in 200 shuffled comparisons, no random score 
approaches the observed score. In general, our experience 
has led us to be conservative in evaluating an observed 
similarity in an unlikely biological context. 

These programs provide a group of sequence analysis 
tools that use a consistent measure for scoring similarity and 
constructing alignments. FASTA, RDF2, and LFASTA all 
use the same scoring matrices and similar alignment algo- 
rithms, so that potentially related library sequences discov- 



ered after the search of a sequence data base can be 
evaluated further from a variety of perspectives. In addition, 
LFASTA can also show alternative alignments between 
sequences with periodic structures or duplications. 
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Beyond complete genomes: from sequence to structure and 
function 

Eugene V Koonin* t Roman L Tatusov and Michael Y Galperin 



Computer analysis of complete prokaryotic genomes shows 
that microbial proteins are in general highly conserved - 
| -70% of them contain ancient conserved regions. This allows 
us to delineate families of orthologs across a wide 
phylogenetic range and, in many cases, predict protein 
functions with considerable precision. Sequence database 
searches using newly developed, sensitive algorithms result in 
the unification of such orthologous families into larger 
superfamilies sharing common sequence motifs. For many of 
these superfamilies, prediction of the structural fold and 
specific amino acid residues involved in enzymatic catalysis is 
possible. Taken together, sequence and structure comparisons 
provide a powerful methodology that can successfully 
complement traditional experimental approaches. 
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Abbreviations 

COGs clusters of orthologous groups 
HAD haloackJ dehalogenase 

Introduction 

The determination of the complete genome sequences 
of several bacteria and arches and one eukaryote 
[l-6,7"-12~J marked the beginning of a new age in biol- 
ogy. For the first time, we can take a look at che corn- 
piece sec of proteins present in the cells of each 
particular organism and try to identify the proteins 
responsible for each cellular function. In cases where no 
known proteins can be found to perform a particular 
task, the most likely substitutes can be predicted from 
the set of unassigned gene products. Clearly this can be 
done only by analysis of complete genomes, as partial 
sequences do not allow us to ascertain that certain pro- 
teins are nor encoded in a given genome [13J. These 
new approaches arc gradually changing our understand- 
ing of a variety of biological phenomena. As the number 
of sequenced genomes is expected to grow exponential- 
ly for the next few years, their impact on different bio- 
logical disciplines will increase. We have recently 
discussed the implications of the complete genomes for 
microbial evolution [14]. Here we consider the effect of 
the genome revolution, together with the improving 
methods for sequence analysis, on our ability to predict 
and understand protein structure and function. 



Towards a natural taxonomy of proteins and 
protein families 

The numerous genome sequencing projects have resulted 
in a rapid growth of protein databases (see, e.g. {151). In 
contrast to the prc-genome era, when researchers typically 
chose to clone and sequence genes with documented 
functional roles, we are now getting many protein 
sequences whose functions are not known. This presents 
a challenge to extract the most from these sequences in 
terms of salient features of the encoded proteins, for exam- 
ple to classify them according to their homologous rela- 
tionships, and to predict their possible catalytic activities 
and/or cellular functions, three-dimensional (3D) struc- 
tures and evolutionary origin, 

* 

Protein classifications, pioneered by Dayhoff and her co- 
workers, have historically been based on sequence align* 
ments. Similar proteins formed families, which were 
combined into superfamilies (16). This approach, contin- 
ued in the PIR database [17], proved extremely popular. 
However, even PIR superfamilies often unite closely 
related proteins and more distant relationships are being 
missed. Other protein databases, such as PROSITE [18], 
PRINTS [19], Pfam [20], and ProDom [21], group pro- 
teins on the basis of conserved sequence motifs and, gen- 
erally, contain much more diverse protein families. 
Structural comparisons of proteins, implemented in FSSP, 
CATH and SCOP databases, offer yet another approach 
to protein classification [22-24J. SCOP superfamilies, for 
example, unite proteins that have some similarities in 
their 3D structures, but often no detectable sequence 
similarity [25]. Thus, in the absence of clear sequence or 
structural similarities, the criteria for inclusion of distant- 
ly related proteins into a family (or superfamily) become 
increasingly arbitrary. 

With the inception of extensive genome sequencing, it has 
become possible to classify genes and proteins on a differ- 
ent principle, namely by delineating families of paralogs — 
related genes within the same genome [26,27). Such 
analyses have revealed a complex hierarchical organization 
of paralogous families in each of the studied genomes and 
produced at least two generalizations: first, the fraction of 
genes that belong to families of paralogs increases with the 
increase of the total number of genes in a genome: from 
-25% in the minimal genome of Mycoplasma genitalium to 
>50% in the large (for a prokaryote) Escherichia colt genome; 
second, the largest superfamilies of paralogs are mostly the 
same in all genomes [28-33], 

Knowledge of all the protein sequences from multiple com- 
piece genomes (Tabic 1) allows us to redefine the entire 
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Table 1 

Protein families and 3D structures In complete genomes. 



Species 



Proteins encoded in the genome* 



COGs found 
(% total) 



3D structures 



Total 
number 


Belong to COGs 1 
V70 toiaij 




in poa 


predicted* 


4289 


2003 (47%) 


821 (95%) 


240 


867 


1717 


979 (57%) 


658 (77%) 


2 


267 


1566 


841 (54%) 


617(72%) 


0 


169 


3169 


1551 (49%) 


796 (93%) 


2 


431 


850 


483 (57%) 


363 (42%) 


0 


105 


4100 


♦ 1945(47%) 


732 (85%) 


12 


578 


467 


341 (75%) 


290 (34%) 


0 


75/103 


677 


378 (56%) 


309 (36%) 


0 


76 


1715 


830 (48%) 


498 (58%) 


0 


170 


1869 


697 (48%) 


484 (56%) 


0 


199 


2407 


1131 (47%) 


512(60%) 


0 


290 


5932 


1736 (29%) 


577,(67%) 


45 


846 


12,178 


2172 (18%) 


466 (54%) 


2 


NA 



Escherichia cofi 
Haemophilus influenzae 
Helicobacter pylori 
Synechocystis sp. 
Borrelia burgdorferi 
Bacillus subtitis 
Mycoplasma geniiatium 
Mycoplasma pneumoniae 
Metnanococcus jannaschii 
Methane-bacterium thermoautotrophicum 
Archaeogtobus fufgidus 
Saccharomyces cerevisiae 
Csenorhabditis eiegans 



•The numbers are from the latest updates in the GenBank genome division (ftp://ncbi.nlm .nih.gov/genbank/genomes). C. eiegans genome is about 
85% complete; the data are from Wormpept 2 (www.sanger.ac.uk/Projects/C_elegans/wormpep). 'Based on the set of 860 COGs, obtained by 
adding H. pylori proteins to the original set of 720 COGs (37**]. *The numbers are from the PEDANT database [63*|, calculated by comparing the 
protein set encoded in each genome to the PDB using PASTA with cutoff score of 120; the second figure for M. genitatium is from [54*]; the data 
for C. efegans are not available. 



problem of protein classification. Since the fraction of pro- 
teins conserved over large phylogenetic distances (ancient 
conserved domains) appears to he nearly constant at -70% 
in all prokaryotic genomes (34*j, it becomes feasible to 
replace more or less arbitrary clustering of proteins by simi- 
larity with consistent groups in which the evolutionary rela- 
tionships between the members are specifically defined. 
Such a classification of proceins can provide a framework for 
evolutionary studies and for rapid, largely automatic, func- 
tional annotation of newly sequenced genomes. 

Several classifications of homologous proteins encoded in 
complete genomes have been produced, based on all- 
against-alJ protein sequence comparisons [35,3637**]. Each 
of these projects is aimed at the identification of orthobgs, 
that is direct counterparts in different genomes, connected 
by an uninterrupted line of vertical descent and typically 
retaining their physiological function (26,27]. In particular, 
the system of clusters of orthologous groups (COGs) was 
designed to accommodate the vastly different evolution 
rates observed for different genes [37"). The COGs con- 
struction procedure identifies the closest homologs in each 
of the sequenced genomes for each protein, even if the sim- 
ilarity is fairly low and nor statistically significant by itself. 
The approach to the identification of COGs was built upon 
the transitivity of orthologous relationships, that is the sim- 
ple notion that any group of at least three genes from dis- 
tant genomes, which are more similar to each other than 
they are to any other genes from the same genomes, is most 
likely to belong to an orthologous family. Clearly, this is a 
probabilistic assumption based on a 'weak molecular clock 
concept*, which posits that orthologs are more similar to 
each other than they are to paraiogs with different, even if 



related, functions. This assumption, however, seems to 
hold true in cases where we have reasons to accept ortholo- 
gy on functional grounds (for example, aminoacyl-tRNA 
synthetases or ribosomal proteins). Orthology is not neces- 
sarily a one-to-one relationship, as in cases of lineage-spe- 
cific duplications, orthology can only be established 
between families of paralogous genes. Such complex rela- 
tionships require caution in the functional interpretation of 
the phylogenetic classification of proteins. Nevertheless, 
about 60% of the original set of 720 COGs [37"] are simple 
families, with no paraiogs or with paraiogs from one lineage 
only, suggesting the possibility of straightforward transfer of 
functional information from functionally characterized 
genes from model systems such as £. co/i and yeast to those 
from poorly characterized genomes. 

The utility of this system of protein classification was test- 
ed on several newly sequenced bacterial, archeal and 
eukaryotic genomes. Interestingly, with the only exception 
of the minimal genome o{ M. genitalia m y the fraction of the 
proteins that belong to the COGs — ancient families con- 
served across a wide phylogenetic range — is about the 
same and very close to 50% for all prokaryotic genomes 
(Table 1). This is clearly compatible with the previous esti- 
mate that about 70% of the proteins encoded in each 
genome contain ancient conserved regions. The fraction of 
the proteins included in the COGs is at this time lower, 
which is evidently due to the requirement for three distant 
lineages to be included, and to the limited number of 
species in the first instalment of the COGs. There is little 
doubt that with new genomes added, the number of COGs 
will asymptotically approach the total number of ancient 
conserved regions. By contrast, this fraction is much lower 
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for cukaryotie genomes, indicating the prevalence of 
eukaryotc-specifie families. 

Comparison of the new protein sets with the COGs result- 
ed in a number of functional predictions for previously 
uncharaeteri/.ed proteins. Even for the Heiicalxicter pylori 
proteins* most of which show highly significant similarity to 
homologs from fi. coli and other bacteria and have been 
described in considerable detail predictions were made 
in more than 100 cases (http://www.ncbhnlm.nih/COG); 
function was also predicted for a number of archeal and 
worm proteins (KV Koonin* RI, Tatusov, MY Galperin, 
unpublished data). 

Missing gene families and evolution of 
metabolic pathways 

Comparative analysis of the available complete genomes 
shows that metabolic diversity generally correlates with 
genome size. Parasitic bacteria import a variety of metabo- 
lites, which allows them to shed genes encoding enzymes 
for many or even most of the metabolic pathways [1-3, 
8~ t .V\3hi. In contrast, all cells have to rely on their own 
gene products for performing such essential functions as 
genome expression* replication and repair, and membrane 
biogenesis and others. These tasks alone require at least 
about 200 genes (13,37"]. 

Given complete genome sequences, classification of pro- 
teins into orthotonus groups provides a convenient way to 
systematically survey the protein families present or 
absent in a genome and to identify the metabolic pathways 
that arc likely to be operative in the organism analyzed. 
When some of the required enzymes cannot be found in 
the genome, the respective parhways are either not opera- 
tive, or use other, unrelated, proteins to catalyze the miss- 
ing steps, (see 13 C J|). An example of such an analysis, which 
included superposition of the phylogenetic patterns 
derived from the COGs [37**], over the scheme of glycoly- 
sis, reveals several interesting trends (Figure \). Glycolysis 
includes three reactions that in different species are cat- 
alyzed by noivorthologous enzymes, namely phosphofruc- 
tokinases, aldolases and phosphoglycerace mutases. 
Interestingly, the second phosphofructokinase in £. roli t 
encoded by the pfkB gene, has apparently been recruited 
from a ubiquitous family of ribokinase-Uke sugar kinases. 
The ribokinase COG seems to be an example of a complex 
family in which the exact orthologous connections are not 
always easy to trace. In particular, even though PfkB for- 
mally belongs to the COG, there seems to be no actual 
ortholog of it in other genomes. Thus H. pylori does not 
encode a phospbofructokinase at all, although it has genes 
for other kinases of the ribokinase family and, accordingly, 
is represented in the respective COG (Figure 1). 

A remarkable case of non-orthologous gene displacement 
involves two unrelated forms of phosphoglycerate mutase, 
the 2,3-bisphosphoglycerate (Wi independent and the 
BPG-independent one. While H. htflttcuzne and Horreiia 



burgdorferi encode only the BPG-depenclent form, and H. 
pylori* mycoplasmas, and archea encode only the BPG- 
independent form (see (40)), free-living bacteria such as £. 
coth Bacillus suhttlis and Sy//eiiio<ys/is sp. possess genes cod- 
ing for both these forms, with two paralogs of the BPG- 
dependent one (Figure 1). Phosphofruetokinase, aldolase 
and fructose bisphosphatase genes are all missing in the 
archea (Figure 1), in accordance with the experimental 
data [41]. This is consistent with the idea that glycolysis 
originally evolved as a biosynthetic pathway, containing 
only the lower (tri-carbon) part [42]. 

Systematic identification of missing links in functional sys- 
tems in organisms for which complete genome sequences 
are available is probably the most important application of 
protein family classification. Conspicuous gaps in the H. 
pylori metabolism became apparent from the COG analy~ 
sis, suggesting major revisions to the general scheme of the 
central metabolic pathways in this bacterium (Table 2). In 
particular, unlike most other bacteria (and all with com- 
pletely sequenced genomes), //. pylori seems to possess 
neither glycolysis nor the pentose phosphate shunt, the 
Entner-Doudoroff pathway being the only major route of 
sugar catabolism. Indeed, sugar fermentation^ resulting in 
intracellular acid production, would be an additional bur- 
den on the pH maintenance mechanism in this bacterium, 
which has to survive in an external pH of 2-3. By contrast, 
gluconeogenes is, which converts organic acids into sugars 
required for nucleic acid and peptidoglycan biosynthesis 
and thus removes H + from the cytoplasm, appears to be 
fully functional in H. pylori. For the purpose of energy pro- 
duction, H. pylori apparently depends on amino acid fer- 
mentation, which causes alkahnization of the cytoplasm 
and thus relieves part of the problem of pH maintenance. 
Amino acids and oligopeptides that serve as substrates for 
this fermentation are produced by gastric proteolysis and 
transported by readily identifiable permeases. 

From genomes and families to superf amilies 
and folds 

Classification systems aimed at the identification of fam- 
ilies of orthologs make no attempt to capture the more 
subtle conserved motifs in proteins, which reflect 
ancient relationships at the level of superfamilies and 
frequently are critically important for understanding pro- 
tein funccions and structures [43,44], Computer methods 
for the detection of such motifs and delineation of super- 
families have lately progressed significantly through pro- 
grams such as BLIMPS/MULTIMAT [45], Probe [46], 
and PSI-BLAST [47:*]> which combine pairwise 
sequence comparisons with profile analysis. PSKBLAST, 
in particular, has proved to be a powerful tool for the 
detection of subtle sequence motifs, resulting in the dis- 
covery of a number of unsuspected superfarnily relation- 
ships [47*\4N*]. Furthermore, one of the perhaps 
under-appreciated benefits of the accumulation of 
genomic sequences is the greatly improved capacity to 
identify even very subtle sequence similarities due to 
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Glycolytic enzymes in organisms with completely sequenced genomes. The enzymes are listed under £ cotf gene names. The COG numbers are 
as in COG database (www.rvcbi.nlm.nih.goWCOG, 137-J) (where available). Shaded arrows indicate reversible reactions, black arrows practically 
irreversible ones. Phosphoenolpyruvate synthase-catalyzed reaction in the direction of phosphoenolpyruvate hydrolysis has been demonstrated in 
vitro. Phylogenetic patterns are: e, Escherichia coti; h, Haemophilus influenzae; u, Helicobacter pylori; b. Bacillus subtilis\ g, Mycoplasma 
genitatium; p, Mycoplasma pneumoniae; I, Borretia burgdorferi c, Synechocystis sp.; m, Methanococcus jannaschii; t, Methanobacterium 
thermoautotrophicum; f, Archaeogfobus fulgidus; y, Saccharomyces cerevisiae; w, Caenorhabditis eiegans. 



the increasingly uniform population of the protein uni- 
verse by these relatively unbiased sequence sets, of 
which the new methods for sequence analysis mentioned 
above can take advantage [49'j. 

In the pasc year, we have seen the identification or signif- 
icant extension of a number of protein superfamilies; 
some examples, with the distribution among complete 
genomes, are shown in Table 3. Most of these supcrfami- 
lies are universally found in all genomes, with the counts 
more or less proportional to the total number of genes in 
the genome. Some expansions are, however, remarkable. 



such as, for example, urease-related hydrolases and ATP- 
grasp domains in the archea, and HAD superfamily hydro- 
lases in £. coli and B. suhtUis (Table 3). In certain cases, the 
phylogenetic distribution of a superfamily immediately 
suggests major evolutionary events. Thus the BRCT 
domain is present in a single copy in the DNA ligase of all 
bacteria {with one additional copy found only in 
Synechocysfis), is missing in the archea, and is dramatically 
expanded in its distribution in the eukaryotes (Table 3). 
The most obvious interpretation of this distribution is that 
this domain has entered the eukaryotic world by horizon- 
tal gene transfer from bacteria and has undergone exten- 
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Table 2 



Genes and pathways missing In Helicobacter pyforL 



Enzyme activity £ colt gene COG number Status in H. pylori indications for H. pylori metabolism 



Phosphofruclokinase 
Pyruvate kinase 



pfkA 
pfkB 
pykA 
pykF 



COG0206 
COG0525 
COG0470 



Missing 
Present (ribokinase) 
Missing 



Absence of the two key glycolytic enzymes shows that 
Embden-Meyerhof pathway is not functional in H. pylori 
Gluconeogenesis enzymes, bypassing these reactions, 
fructose bi sphosphatase (H P 1 3 85) and 
phosphoenolpyruvale synthase (HP01 21), are present in 
H. pylori, allowing it to produce sugars required for 
peptidoglycan biosynthesis. 



6-phosphogluconate 
dehydrogenase 

Ribose 5-phosphate 
isomerase 



Lipoate synthase 
Lfpoate* protein 
ligase 

Dihydroltpoamide 
acyl transferase 
Acetate kinase 

Phospho- 
transacetylase 



god 

rpiA 



HpA 
IplA 
lipB 
aceF 

ackA 

pta 



COG0360 
COG0120 



COG0318 
COG04U 
COG0319 
COG0510 

COG0280 

COG0278 



Missing 
Missing 



Missing 
Missing 
Missing 
Missing 

Disrupted by a 

frameshift 
Disrupted by 
frameshifts 



Pentose phosphate pathway is also not functional Even 
though K pylori has a ribose 5-phosphate isomerase 
encoded by an ortholog of the E. coli rpt'B, no gene coding 
for 6-phosphogluconate dehydrogenase could be identified. 
The only saccharofytic pathway in H. pylori appears to be 
the Entner-Doudoroff pathway. 

Pyruvate dehydrogenase complex is absent in H pylori] 
acetate kinase and phosphotransacetyiase are not 
functional Pyruvate-ferredoxtn oxidoreductase is the only 
acetyl-CoA-producing enzyme in H. pylori 



Enzymes of purine 
biosynthesis 



purF 


COG0034 


Missing 


purD 


COG0151 


Inactivated by 






mutations 


purN 


COG0299 


Missing 


purT 


COG0027 


Missing 


purLjl 


COG0046 


Missing 


purljl 


COG0047 


Missing 


purM 


COG0150 


Missing 


purK 


COG0026 


Missing 


purE 


COG0041 


Missing 


purC 


COG0152 


Missing 


purH 


COG0138 


Missing 


purA 


COG0104 


Present 


pisrB 


COG0015 


Present 


guaB 


COG0516 


Present 


guaA_1 


COG0518 


Present 


guaA_2 


COG0519 


Present 



Oe novo purine biosynthesis is absent in H. pylori, and it 
has lo obtain purines from the host. HP1 1 85 appears to be 
the best candidate for the purine permease, as it is the only 
H. pylori protein, similar to £ cofiPurP. 



On the other hand, H. pylori encodes the enzymes for AMP 
and GMP synthesis from IMP and their interconversion. 
Therefore, it can survive on any of these purines. 



sive duplication with divergence in the eukaryotes. The 
expansion of this domain into a number of eukaryotic pro- 
teins involved in cell-cycle control [5P/\51] may have 
been critical for the very establishment of these systems. 

With the current acceleration in protein structure determi- 
nation 122,24), a superfamily identified by sequence com- 
parison more and more frequently extends to include 
proteins with known 3D structure and/or well-character- 
ized catalytic mechanism (Table 3). Such findings are 
sometimes most illuminating as they immediately result in 
the prediction of the structural fold, the structure of the 
active center, and possibly also the catalytic mechanism for 
a wide variety of diverse proteins comprising the super- 
family. This is illustrated by the recent prediction of the 



structure and the catalytic amino acid residues for P- 
ATPases, which remained elusive in spite of a long history 
of studies, on the basis of the sequence motifs shared with 
haloacid dehalogenases [52*]. 

Assignment of the gene products to structural folds and fam- 
ilies with maximal attainable precision is arguably one of the 
foremost tasks of genome analysis after the sequencing 
phase. The number of structures that have been determined 
experimentally is negligible for almost all genomes, with che 
exception of E. coli (where it is still rather a small fraction) 
(Table 1). A database search with a deliberately conservative 
similarity cut-off already increases the fraction of proteins for 
which a confident structure prediction is possible to 10-25% 
[53'] (Table 1). Secondary structure- based threading allows 
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unorhcr relatively small but notable increase in the predictive 
pow er (54*J (Table 1). It appears, however, that at this time, 
the most realistic way to further structure prediction at 
genome scale is co perform a complete analysis of protein 
superfumilios as exemplified in Tabic 3. 

Perspective 

As far as prokaryotic genomes are concerned, we have 
already entered the post-genomic era. White surprises 
certainty waif ahead, there is little doubt that the major 
protein families are already known or can be deciphered 
from the available sequences. We have recently seen 
major progress in methods and procedures for advanced 
sequence analysis, and a lot of valuable information has 
been extracted from the genomes. We believe, however, 
that a major focused effort in genome comparison is still 
required in order to construct a proper classification of 
protein families and supcrfamilies and systematically 
apply it to the goals of structural and functional predic- 
tion. Such an effort will have the potential of creating a 
basis for a rationally designed, decisive onslaught on 
structure determination and experimental identification 
of gene functions using computer predictions as a guide. 
Hopefully, this research program turns out to be both 
realistic and efficient. 
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proteins or orlhologous sets of paralogs from at least three lineages. 
Orthologs typically have the same function, allowing transfer of functional 
information from one member to an entire COG. This automatically makes 
possible a number of functional predictions, especially for poorly 
characterized genomes. The evolving system of COGs comprises a 



framework lor functional and evolutionary genome analysis; it is accessible 
through the World Wide Web (http://ncbi.fi lm.nih.gov/COG). 

39. Htmmelreich R, Plagens H, Htfbert H, Reiner B, Herrmann R: 
Comparative analysis of the genomes of the bacteria 
Mycoplasma pneumoniae and Mycoplasma genrtetium. Nucleic 
Adds Res 1 997, 25:701-71 2. 

39. Koonin EV, Mushegian AR, Bork P: Norvorthologous gene 
displacement Trends Genet 1996, 12:334-336. 

40. Galperin MY, Bairoch A, Koonin EV: A superfamily of 
metaitoeniymes unifies phosphopentomutase and cofactor* 
independent phosphogJycerate mutase with alkaline 
phosphatases and sulfatases. Protein Sci 1 996. 7:in press. 

4 1 . Danson MJ: Central metabolism of the archaea. In The 

Biochemistry ol Archaea (Archaebacteria). Edited by Kates M t 
Kushner DJ, Matheson AT. Amsterdam: Elsevier: 1993:1-24. 

42. Romano AH, Conway T: Evolution of carbohydrate metabolic 
pathways. Res Microbiol 1996. 147:446-455. 

43. Bork P, Koonin EV: Protein sequence motifs. Curr Opin Struct Bid 
1996, 6:366-376. 

44. Bork P, Gibson TJ: Applying motif and profile searches. Methods 
Enzymd 1996, 266:162-184. 

46. Henikoff S, Henikoff JG: Embedding strategies for effective use of 
Information from multiple sequence alignments. Protein Sci 1 997, 
6:698-705. 

46. Neuwald AF, Liu JS, Lipman DJ, Lawrence CE: Extracting protein 
alignment models from the sequence database. Nudeic Adds Res 
1997,25:1665-1677 

47. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zheng Z, Miller W, 
Lipman DJ: Gapped BLAST and PSI-BLAST • A new generation of 
protein database search programs. Nucleic Acids Res 1 997, 
25:3389-3402. 

A major revamp of BLAST, which is definitely the most popular current 
method lor database search. The key innovations are: first, the program now 
makes gapped alignments, with appropriately modified statistics, which 
results in significant increase of sensitivity; and second, the associated 
program PS I (Position-Specific Iterating) -BLAST makes a position -specific 
weight matrix (profile) out of the first pass results and iterates searches with 
this profile until no new sequences with similarity scores above a defined 
cut-off are delected. This appears to be the most powerful existing method 
for detection of subtle similarities between protein sequences and 
delineation of protein superfamilies. 

48. Mushegian AR, Bassett DE Jr, Boguski MS, Bork P, Koonin EV: 

♦ Posiuonally cloned human disease genes: patterns of 
evolutionary conservation and functional motifs. Proc Nat! Acad 
Sci USA 1 997, 94:5631 -6836. 

Sequence analysis of the proteins encoded by 70 posilionally cloned 
human disease genes showed that most of them have orthologs with the 
same domain architecture in the nematode, but domain rearrangements are 
prevalent in yeast and bacterial homotogs. This is one of the first 
demonstrations of the utility of PSl BLAST for the delineation of large 
protein superfamilies. In particular, this method was used for the 
identification of a conserved ATPase domain present in the repair protein 
MulL (one of the colon cancer gene products in humans), histidine kinases, 
molecular chaperones of the HSP90 family and type II DNA 
topoisomerases; the 3D structure for the latter was already available, 
defining (he fold for the whole superfamily. 

49. Bork P. Koonin EV: Predicting functions from protein sequences: 

• where are the bottlenecks? Nature Genet 1 99 8, 1 8 :3 1 3 3 1 6. 

An attempt to analyze the reasons why it is so common that functionally 
and phylogenetically important relationships between sequences are not 
delected in original analysis (particularly in the framework of genome 
projects) but are readily identified in subsequent, more detailed studies, ft 
appears that the major bottlenecks include inadequate filtering for noise in 
sequence data (for example low-complexity sequences and very common 
domains) and insufficient cross-talk between different types of information. 

50. Bork P, Hofmann K, Bucher P, Neuwald AF. Altschul SF, Koonin EV: A 
superfamily of conserved domains in DNA damage-responsive 
cell cycle checkpoint proteins. FASEB J 1 997, 1 1 :68-76. 

A complete description of the BRCT domain that had been originally found 
in BRCA1 protein and several other proteins implicated in celt cycle 
checkpoint in this work, the superfamily has been extended to include a 
distinct version ol the BRCT domain detected in bacterial DNA ligases, the 
large subunils of eukaryotic replication factor C t and polyfADP-ribose) 
polymerases. The expansion of the BRCT domain in eukaryotes may be one 
of the key events in the evolution of cell-cycle control. 
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6 1. Caifebaut I, Mornoo JP: From 8RCA1 to RAP1: a widespread BRCT 
module closely associated with DN A repair. FE&S Lett 1 997, 
400:2530. 

52. Aravind L, Galperin MY, Koonin EV: The catalytic domain of the P- 

• type ATPase has the haloacid dehalogenase fold. Trends Biochem 
Sd 1996, 23:127-129. 

This paper is an example of the application of sequence profile analysis 1o 
the prediction of the 30 fold and (he catalytic residues in a critically 
important enzyme. P-ATPase. which has defied crystallization attempts and 
remained poorly characterized in spite of intense effort. 

53. Frishman D. Mewes HW: PEOANTic genome analysis. Trends Genet 

• 1997,13:419-416. 

This paper describes a very convenient Worldwide Web site compiling 
results of automatic analysts of ail available complete genomes. The Pedant 
WWW site (hup^ypedantmips.bioohem.mpc>de/fnshman/pedant.htm1) is 
arguably one of the best entry points to comparative genomics but it has 
to be kept in mind that it is only the first level, crude analysis that is 
presented here. 

54. Fischer O, Etsenberg 0: Assigning folds to the proteins encoded by 

• the genome of Mycoplasma genitatium. Proc Nati Acad Set USA 
1997.94:11929-11934. 

One of the first systematic attempts to predict the 30 structures of proteins 
starting from a complete genome. The utility of sequence-structure 
threading is demonstrated but it also becomes clear that such methods at 
best resutt in a rather small, incremental improvement over state-of-the-art 
sequence comparisons. Although the fraction of the proteins with a 



predictable fold is only 22% of the gene products, the authors predict by 
extrapolation that it should be possible to assign folds to most soluble 
proteins within a decade. 

55. Holm L, Sander C: An evolutionary treasure: unification of e broad 
• set of amidohydrotases related to urease. Proteins 1 997, 23:72-62. 
A valuable example of a combination of detailed sequence analysis with 
structure-structure comparisons resulting in the characterization of a vast 
protein superfamily. 

56. Stukey J, Carman GM: identification of a novel phosphatase 
sequence motif. Protein Sci 1997, 6:469-472. 

57. Neuwald AF: An unexpected structural relationship between 
integral membrane phosphatases and soluble haloperoxidases. 

Protein Sci 1997, 6:1764-1767. 

58. Galperin MY, Koonin EV: A diverse superfamily of enzymes with 
ATP-dependent carboxylate-amine/khiol ligase activity. Protein Sci 
1997,6:2639-2643. 

59. Aravind L, Koonin EV: A novel family of predicted 
phosphoesterases Includes Drosophita prune protein and 
bacterial RecJ exonuclease. Trends Biochem ScH 998, 23:17-19. 

60. Bond OS, Clements PR, Ashby SJ, Colryer CA, Harrop SJ, Hopwood 
JJ, Guss JM: Structure of a human lysosomal sulfatase. Structure 
1997,5:277-289. 
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CAF19551 252 aa linear BCT 17-APR-2005 

3 1 -Phosphoadenosine 5 ' -phosphosulf ate (PAPS) 3 1 -phosphatase 
[Corynebacterium glutamicum ATCC 13032] . 
CAF19551 

CAF19551.1 GI :41325070 
embl accession BX927150 . 1 

* 

Corynebacterium glutamicum ATCC 1303 2 
Corynebacterium glutamicum ATCC 13032 

Bacteria; Actinobacteria ; Actinobacteridae; Act inomyce tales ; 
Corynebacterineae ; Corynebacteriaceae ; Corynebacterium . 

1 (residues 1 to 252) 

Kalinowski , J. , Bathe, B., Bartels,D., Bischoff,N., Bott,M., 
Burkovski, A. , Dusch,N., Eggeling,L., Eikmanns , B . J . , Gaigalat,L., 
Goesmann,A., Hartmann,M. , Huthmacher , K. , Kramer, R., Linke,B., 
McHardy, A.C. , Meyer, F., Mockel,B., Pf ef f erle , W . , Punier, A. , 
Rey , D . A. , Ruckert,C, Rupp,0., Sahm,H., Wendisch, V. F . , Wiegrabe,I. 
and Tauch,A. 

The complete Corynebacterium glutamicum ATCC 13032 genome sequence 
and its impact on the production of L-aspartate-derived amino acids 
and vitamins 

J. Biotechnol. 104 (1-3), 5-25 (2003) 
12948626 

2 (residues 1 to 252) 
Kalinowski, J. 

Direct Submission 

Submitted (21 - JAN-2004 ) Joern Kalinowski, Institut fuer 

Genomf orschung, Universitaet Bielefeld; Universitaetsstrasse 25, 

33615 Bielefeld, Germany 

E-mail : Joern. Kalinowski@Cebitec .Uni-Bielef eld.DE 

This sequence was accomplished by collaboration between Degussa AG 
and Bielefeld University. 

join (BX927148 .1:1. . 348071 , BX92 714 9 . 1 : 51 . .3498 87, 
BX927150.1:51. .348475, 

1:51. . 349459, BX927152 .1:51. . 34 9799 , BX92 7 153 . 1 : 51 . 
1:51. . 349575, BX927 155 .1:51. . 34 913 6 , BX92 7156 . 1 : 51 . 
1:51. .140057) . 
Location/Qualifiers 
1. .252 

/organism= "Corynebacterium glutamicum ATCC 13 032" 
/strain="DSM 20300 = ATCC 13032" 
/db xref ="taxon: 196627" 



BX927151 
BX927154 
BX927157 



.349584, 
.349115, 
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/note="IS fingerprint type: 4-5" 
Protein 1. .252 

/product="3 ' -Phosphoadenosine 5 ' -phosphosulf ate (PAPS) 

3 1 -phosphatase" 
Region 10 . . >229 

/ r eg i on_name = " Cy s Q , a 

3 ' -Phosphoadenosine -5 ' -phosphosulf ate (PAPS) 
3 ' -phosphatase, is a bacterial member of the inositol 
monophosphatase family" 
/note="CysQ" 
/ db_xr e f = " CDD : 30136 " 
CDS 1. .252 

/gene=" cysQ" 
/locus_tag="cg0967" 

/coded_by=" complement (BX927150 . 1 : 2 02 863 . .203621) " 

/transl_table=ll 

/ db_xr e f = " GOA : Q8NS 3 7 " 

/db_xref = " InterPro : IPR000760 " 

/ db_xr e f = " Uni P rot KB/ TrEMBL : Q8NS37 " 



ORIGIN 



// 



1 mtaqiddsil thrlaqgtge ilkgvrnvgv lrgrnlgdag delaqswiar vleqhrpndg 

61 flseeaadnp drlskdrvwi idpldgtkef atgrqdwavh ialvengvpt haavglpdlg 

121 wfhsadara vtgpyskvia ishnrppkva lscaeqlgfe tkalgsagak amhvllgdyd 

181 ayihaggqye wdsaapvgvc kaaglhcsrl dgseltynnk dtympdilic rpeladelle 

241 mcakfyeeng ty 
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Comment Features Sequence 
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BAB98238 252 aa linear BCT 03-FEB-2005 

3 ' -Phosphoadenosine 5 1 -phosphosulf ate (PAPS) 3 ' -phosphatase 
[Corynebacterium glutamicum ATCC 13 032] . 
BAB98238 

BAB9823 8 .1 GI : 21323611 
accession BA000036 . 3 

■ 

Corynebacterium glutamicum ATCC 13032 
Corynebacterium glutamicum ATCC 13032 

Bacteria; Actinobacteria; Actinobacteridae ; Actinomycetales ; 

Corynebacterineae; Corynebacteriaceae; Corynebacterium. 

1 

Nakagawa , S . 

Complete genomic sequence of Corynebacterium glutamicum ATCC 13032 

Unpublished 

2 (residues 1 to 252) 

Nakagawa, S . 

Direct Submission 

Submitted (24 -MAY-2002 ) Satoshi Nakagawa, Kyowa Hakko Kogyo Co. 
Ltd., Tokyo Research Laboratories; 3-6-6, Asahi-machi, Machida, 
Tokyo, 194-8533, Japan (E-mail : snakagawa@xanagen . com, 
Tel : 81-44-82 9-3031, Fax : 8 1 -44 - 813 - 1651 ) 

This sequence is conducted by collaboration of Kyowa Hakko Kogyo 
Co. Ltd. And Kitasato University. 

Location/ Qualifiers 

1. .252 

/organism=" Corynebacterium glutamicum ATCC 13032" 
/strain="ATCC 13032" 
/db_xref = " taxon : 196627" 
1. .252 

/product="3 ■ -Phosphoadenosine 5 ' -phosphosulf ate (PAPS) 
3 ' -phosphatase" 
10. . >229 

/region_name= "CysQ, a 

3 ' -Phosphoadenosine -5 ' -phosphosulf ate (PAPS) 
3 1 -phosphatase, is a bacterial member of the inositol 
monophosphatase family" 
/note="CysQ" 
/ db_xr e f = " CDD : 30136 " 
1 . .252 

/gene="Cgl0845" 

/coded_by= "complement (BA000036 . 2 : 899250 . . 900008) " 
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1 mtaqiddsil thrlaqgtge ilkgvrnvgv lrgrnlgdag delaqswiar vleqhrpndg 

61 flseeaadnp drlskdrvwi idpldgtkef atgrqdwavh ialvengvpt haavglpdlg 

121 wfhsadara vtgpyskvia ishnrppkva lscaeqlgfe tkalgsagak amhvllgdyd 

181 ayihaggqye wdsaapvgvc kaaglhcsrl dgseltynnk dtympdilic rpeladelle 

241 mcakfyeeng ty 
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TITLE 
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YP_225137 252 aa linear BCT 17-JAN-2006 

3 ' -Phosphoadenosine 5 ' -phosphosulf ate (PAPS) 3 ' -phosphatase 
[Corynebacterium glutamicum ATCC 13 03 2] . 
YP_22513 7 

YP_22513 7 . 1 GI : 6238 973 5 
REFSEQ: accession NC 006958.1 
complete genome. 

Corynebacterium glutamicum ATCC 13032 
Corynebacterium glutamicum ATCC 13032 

Bacteria ; Actinobacteria ; Act inobacteridae ; Actinomycetales ; 
Corynebacterineae ; Corynebacteriaceae; Corynebacterium. 

1 (residues 1 to 252) 

Kalinowski , J. , Bathe,B., Bartels,D., Bischoff,N., Bott,M., 
Burkovski, A. , Dusch,N., Eggeling,L., Eikmanns , B . J . , Gaigalat,L., 
Goesmann,A., Hartmann,M., Huthmacher , K. , Kramer,R., Linke,B., 
McHardy,A.C. , Meyer, F., Mockel,B., Pf ef f erle, W . , Punier, A. , 
Rey,D.A., Ruckert,C, Rupp,0., Sahm,H., Wendisch, V. F . , Wiegrabe,I. 
and Tauch,A. 

The complete Corynebacterium glutamicum ATCC 13032 genome sequence 
and its impact on the production of L-aspartate-derived amino acids 
and vitamins 

J. Biotechnol. 104 (1-3), 5-25 (2003) 
12948626 

2 (residues 1 to 252) 
NCBI Genome Project 
Direct Submission 

Submitted (07-APR-2005) National Center for Biotechnology 
Information, NIH, Bethesda, MD 20894, USA 

3 (residues 1 to 252) 
Kalinowski , J. 

Direct Submission 

Submitted (21 - JAN-2004 ) Institut fuer Genomf orschung, Universitaet 
Bielefeld, Universitaetsstrasse 25, Bielefeld 33615, Germany 
PROVISIONAL REFSEQ : This record has not yet been subject to final 
NCBI review. The reference sequence was derived from CAF19551 . 
Method: conceptual translation. 

Location/Qualifiers 

1. .252 

/organism= "Corynebacterium glutamicum ATCC 13 032" 
/strain="DSM 20300; ATCC 13032" 
/ db_xr e f = " ATCC : 1 3 0 3 2 " 
/db xref="taxon: 196627" 
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/note="IS fingerprint type 4-5" 
Protein 1..252 

/product="3 ' -Phosphoadenosine 5 1 -phosphosulf ate (PAPS) 

3 ' -phosphatase" 

/calculated_mol_wt=27151 
Region 10 . . >229 

/ region__name= " CysQ , a 

3 ' -Phosphoadenosine-5 ' -phosphosulf ate (PAPS) 
3 ' -phosphatase, is a bacterial member of the inositol 
monophosphatase family" 
/note="CysQ" 
/db_xref = "CDD : 30136 " 
CDS 1. .252 

/gene="cysQ" 
/locus_tag= "cg0967" 

/coded_by=" complement (NC__006958 . 1 : 900721 . . 901479) " 
/ trans 1 table= 11 
/db_xref = "Gene ID : 3345270 " 

ORIGIN 

1 mtaqiddsil thrlaqgtge ilkgvrnvgv lrgrnlgdag delaqswiar vleqhrpndg 
61 flseeaadnp drlskdrvwi idpldgtkef atgrqdwavh ialvengvpt haavglpdlg 
121 wfhsadara vtgpyskvia ishnrppkva lscaeqlgfe tkalgsagak amhvllgdyd 
181 ayihaggqye wdsaapvgvc kaaglhcsrl dgseltynnk dtympdilic rpeladelle 
241 mcakfyeeng ty 
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lalign output for SEQ ID NO:6 vs. CAF 1 955 1 . Page 1 of 1 

lalign output for SEQ ID NO:6 vs. CAF19551 
[ISREC-Server] Date: Mon Jun 26 19:28:01 Europe/Zurich 2006 

...... i ■ ■ i ■ i ii i ■ ii ■ i ■ — i ■ ii ii i ii ii i ~~ ^ ^ ^ ___ 

ywwwtmp/lalign/.19436.1.seq : 252 aa 

ALIGN calculates a global alignment of two sequences 

version 2 . OuPlease cite: Myers and Miller, CABIOS (1989) 4:11-17 
SEQ ID NO: 6 252 aa vs. 

CAF19551 252 aa 

scoring matrix: BLOSUM50 , gap penalties: -14/-4 
100.0% identity; Global alignment score: 1703 

10 20 30 40 50 60 

. /wwwt MTAQ I DDS I LTHRLAQGTGE I LKGVRNVGVLRGRNLGD AGDEL AQS W I AR VLEQHRPNDG 

CAF 195 MTAQI DDS I LTHRLAQGTGE I LKGVRNVGVLRGRNLGDAGDELAQS W I ARVLEQHRPNDG 

10 20 30 40 50 60 

70 80 90 100 110 120 

. /wwwt FLSEEAADNPDRLSKDRVWIIDPLDGTKEFATGRQDWAVHIALVENGVPTHAAVGLPDLG 

CAF195 FLSEEAADNPDRLSKDRVWIIDPLDGTKEFATGRQDWAVHIALVENGVPTHAAVGLPDLG 

70 80 90 100 110 120 

130 140 150 160 170 180 

. /wwwt WFHSADARAVTGPYSKVIAISHNRPPKVALSCAEQLGFETKALGSAGAKAMHVLLGDYD 

CAF195 WFHSADARAVTGPYSKVIAISHNRPPKVALSCAEQLGFETKALGSAGAKAMHVLLGDYD 

130 140 150 160 170 180 

190 200 210 220 230 240 

. /wwwt AYIHAGGQYEWDSAAPVGVCKAAGLHCSRLDGSELTYNNKDTYMPDILICRPELADELLE 

CAF195 AYIHAGGQYEWDSAAPVGVCKAAGLHCSRLDGSELTYNNKDTYMPDILICRPELADELLE 

190 200 210 220 230 240 

250 

./wwwt MCAKFYEENGTY 



CAF195 MCAKFYEENGTY 

250 
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lalign output for SEQ ID NO:6 vs. BAB98238 Page 1 of 1 

lalign output for SEQ ID NO:6 vs. BAB98238 
[ISREC-Server] Date: Mon Jun 26 19:35:51 Europe/Zurich 2006 




Vwwwtmp/lalign/.12164.1.seq : 252 aa 

ALIGN calculates a global alignment of two sequences 

version 2 . OuPlease cite: Myers and Miller, CABIOS (1989) 4:11-17 
SEQ ID NO: 6 252 aa vs. 

BAB98238 252 aa 

scoring matrix: BLOSUM50 , gap penalties: -14/ -4 
100.0% identity; Global alignment score: 1703 

10 20 30 40 50 60 

. /wwwt MT AQ I DD S I LTHRL AQGTGE I L KG VRNVG VLRGRNLGD AGD E L AQ S W I AR VLE QHR PNDG 

BAB 982 MTAQIDDS I LTHRLAQGTGE ILKGVRNVGVLRGRNLGDAGDELAQSWI ARVLEQHRPNDG 

10 20 30 40 50 60 

70 80 90 100 110 120 

. /wwwt FLSEEAADNPDRLSKDRVWIIDPLDGTKEFATGRQDWAVHIALVENGVPTHAAVGLPDLG 

BAB982 FLSEEAADNPDRLSKDRVWIIDPLDGTKEFATGRQDWAVHIALVENGVPTHAAVGLPDLG 

70 80 90 100 110 120 

130 140 150 160 170 180 

. /wwwt WFHSADARAVTGPYSKVIAISHNRPPKVALSCAEQLGFETKALGSAGAKAMHVLLGDYD 

BAB982 WFHSADARAVTGPYSKVIAISHNRPPKVALSCAEQLGFETKALGSAGAKAMHVLLGDYD 

130 140 150 160 170 180 

190 200 210 220 230 240 

. /wwwt AYIHAGGQYEWDSAAPVGVCKAAGLHCSRLDGSELTYNNKDTYMPDILICRPELADELLE 

BAB982 AYIHAGGQYEWDSAAPVGVCKAAGLHCSRLDGSELTYNNKDTYMPDILICRPELADELLE 

190 200 210 220 230 240 

250 

./wwwt MCAKFYEENGTY 



BAB982 MCAKFYEENGTY 

250 
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cysQ, a Gene Needed for Cysteine Synthesis in Escherichia coli 

K-12 Only during Aerobic Growth 
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The initial steps in assimilation of sulfate during cysteine biosynthesis entail sulfate uptake and sulfate 
activation by formation of adenosine 5'-phosphosulfate, conversion to 3'«phosphoadenosine 5'-phosphosulfate, 
and reduction to sulfite. Mutations in a previously uncharacterized Escherichia coli gene, cysQ, which resulted 
in a requirement for sulfite or cysteine, were obtained by in vivo insertion of transposons TnStacl and TnSsupF 
and by in vitro insertion of resistance gene cassettes. cysQ is at chromosomal position 95.7 min (kb 4517 to 4518) 
and is transcribed divergently from the adjacent cpdB gene. A TnStacl insertion just inside the 3' end of cysg, 
with its isopropyl-P-D-thiogalactopyranoside-inducible tac promoter pointed toward the cysQ promoter, 
resulted in auxotrophy only when isopropyl-p-D-thiogalactopyranoside was present; this conditional phenotype 
was ascribed to collision between converging RNA polymerases or interaction between complementary 
antisense and cysQ mRNAs. The auxotrophy caused by cysQ null mutations was leaky in some but not all E. 
coli strains and could be compensated by mutations in unlinked genes. cysQ mutants were prototrophic during 
anaerobic growth. Mutations in cysQ did not affect the rate of sulfate uptake or the activities of ATP sulfurylase 
and its protein activator, which together catalyze adenosine 5'-phosphosulfate synthesis. Some mutations that 
compensated for cysQ null alleles resulted in sulfate transport defects. cysQ is identical to a gene called amtA, 
which had been thought to be needed for ammonium transport. Computer analyses, detailed elsewhere, 
revealed significant amino acid sequence homology between cysQ and suhB of E. coli and the gene for 
mammalian inositol monophosphatase. Previous work had suggested that 3'-phosphoadenoside S'-phosphosul- 
fate is toxic if allowed to accumulate, and we propose that CysQ helps control the pool of 3'-phosphoadenoside 
5'-phosphosulfate, or its use in sulfite synthesis. 



The cysteine biosynthetic pathway (Fig. 1), a principal 
route of sulfur assimilation, involves more than 15 genes in 
at least five chromosomal regions in Escherichia coli and 
Salmonella typhimurium. It has been studied since the early 
days of physiological genetics in order to elucidate the roles 
of the individual genes, the control of their expression, and 
how the flow of metabolic intermediates is regulated (for a 
review, see reference 25). The transcription of most cys 
genes is positively controlled by the protein product of cysB 
and its coinducer, O-acetyl serine (also a cysteine precur- 
sor), during aerobic growth; transcription is repressed by 
sulfide, which is generated by reversal of the final biosyn- 
thetic step (Fig. 1). CysB seems not to be needed during 
anaerobic growth (3). The cysQ gene described here is also 
needed only during aerobic growth. It is inferred to act 
before sulfite formation, and hence this early part of the 
cysteine pathway is reviewed briefly below. 

The initial step, sulfate uptake, is mediated by a permease 
encoded by the cysT, cysW, and cys A genes, which, along 
with cysP, constitute one operon (49). CysP protein is 
needed for maximal thiosulfate and sulfate binding, but it is 
probably not part of the permease, and its role in cysteine 
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biosynthesis is unclear (19). A cysZ gene, about 10 kb from 
the cysPTWA operon, may also be needed for sulfate uptake 
(41). Intracellular sulfate is activated via synthesis of aden- 
osine 5'-phosphosulfate (APS) by ATP sulfurylase, which is 
encoded by cysD and cysN (34). This activation step is 
complex, in that the rate of APS formation is greatly 
enhanced both by a protein activator (31, 32) and by GTP 
hydrolysis (33). APS is converted to 3'-phosphoadenosine 
5'-phosphosulfate (PAPS) by APS kinase, encoded by cysC. 
This step is thought to not require cofactors, because APS 
kinase activity does not change during enzyme purification 
(46). Sulfite is generated from PAPS in a complex reaction 
involving transfer and reduction of its sulfuryl moiety. This 
reaction is catalyzed by the cysH gene product, PAPS 
sulfo transferase, and involves a thioredoxin- or glutare- 
doxin-bound intermediate (51, 52). 

Strains with mutations in cysH or in both trxA and grx 
(encoding thioredoxin and glutaredoxin, respectively) grow 
poorly. The poor growth can be corrected by additional 
mutations in cysC (APS kinase) or genes for earlier steps in 
the pathway (16, 45a), a result indicating that PAPS or one of 
its derivatives is toxic if allowed to accumulate. We find this 
result interesting in the context of understanding mecha- 
nisms by which organisms cope with the many metabolic 
intermediates that are both essential for healthy growth and 
potentially deleterious. Other studies have shown that the 
activities of ATP sulfurylase and APS kinase decrease 
rapidly when growth is slowed (25, 26). Such instability 
could help modulate metabolite flow through this pathway 
and would be more sensitive to decreased need for PAPS 
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FIG. 1. Pathway of cysteine biosynthesis in E. coli (modified from data in references 25, 34, and 49). 



than any transcriptional regulation. The dependence of APS 
synthesis (and thereby PAPS synthesis) on the ATP sulfury- 
lase activator and on the local concentration of GTP (32, 33) 
might also help regulate PAPS levels. 

Mutations in the cysQ gene described here result in a 
requirement for cysteine or sulfite that is expressed only 
during aerobic growth and that is leaky in many but not all 
laboratory strains of E. coli. Our studies suggest that CysQ 
may help control the levels of PAPS, its localization, or its 
use in sulfite synthesis. 

MATERIALS AND METHODS 

Strains, media, and general methods. The bacterial strains 
and plasmids used in this study are listed in Table 1. Bacteria 
were grown in LN broth (5) or Vogel-Bonner glucose- 
minimal salts medium (54). An M9-based minimal salts 
medium, with ammonium acetate in place of ammonium 
chloride (21), was used where indicated. Solid media con- 
tained 1.5% Difco Bacto-Agar. Antibiotics were used at the 
following concentrations: ampicillin, 250 u,g/ml; kanamycin, 
60 jig/ml; tetracycline, 12 jig/ml; streptomycin, 100 jig/ml; 
and chloramphenicol, 20 |xg/ml. Isopropyl-3-D-thiogalacto- 



pyranoside (IPTG) was used at 0.5 mM. Amino acids were 
added at 50 fig/ml except for glycine, which was added at 200 
u,g/ml. Standard procedures were used for bacterial growth, 
characterization of auxotrophy and conjugation, DNA prep- 
aration, restriction endonuclease digestion, DNA electro- 
phoresis, recombinant DNA cloning, and transformation 
(12, 47). All enzymes were obtained from commercial 
sources (Life Technologies, Inc., Stratagene, New England 
BioLabs, or Boehringer Mannheim) and used as directed. 
Anaerobic (H 2 -C0 2 atmosphere) conditions were obtained 
by using BBL GasPak Anaerobic jars and the BBL GasPak 
plus system (Becton Dickinson and Co.). 

Assays. The activity of 2\3'-cyclic phosphodiesterase (en- 
coded by the cpdB gene) was measured as the release of 
inorganic phosphate from cyclic UMP (4). Sulfate uptake 
was measured as depletion of 35 S0 4 added to the medium 
with cells grown with djenkolic acid as the sulfur source and 
concentrated from the exponential phase (19). ATP sufury- 
lase was measured as incorporation of 35 S0 4 into PAPS, 
detected by thin-layer chromatography with dialyzed ex- 
tracts of cells grown with sulfite as the sulfur source and 
induced with O-acetyl-l-serine (34). The level of the activa- 
tor of ATP sulfurylase was also measured as PAPS synthesis 



Vol. 174, 1992 



cysQ GENE OF £. COLI 417 



TABLE 1. Bacterial strains, phage, and plasmids 



Strain, phage, or plasmid 



Description or genotype 



Source or reference 



E, coli 
AJ2653 fl 
BW6458 
CAG5052 

DB747 

DB1434 

DB4496 

DB5463 

DB5508 

DB5659 

DB6302 a 

DB6316 

DB6908 

DB6913 

DB6935 

DB7101 

DBan41 

DBan41TR 

DK21 

ET8000 

JC1289 

MC1061 

MG1655 
TGI 

BW6164 

Phages 
M13mpl8 
X* 

\::Tn5tacJ 

X419 

X656 

kA(cysD~N)::kan 
kcysQr.kan 
\656cysQ: :Tn5supF 
Plclr 

Plasmids 
pBRGanlOl 
pBRGanl02 
pBRGanl03 
pBRGanl04 
pBRGanllO 
pBRGanlll 

pBRGanlll-1 to pBRGanlll-5 

pcysQv.kan 

pCM4 

pBR322 

p3 

pBRG1310 



ET8000 amtA (cysQ)::TnlO 
proC::Tn5 */e::Tn/0-BJW43 metBl relAl 
Hfr btuB3191::TnI0 f transfer counterclockwise 
from 7 min 

W3350 gal rpsL sup 0 (strain 594 of reference 7) 

DB747 (Xp/ac5 cI857 Sam7) 

MC1061 dam::Tn9 (p3)(pBRG1310) 

HfrH /acZ(Am) /rp(Am) sup 0 

recD (p3) 

DB747 cysQv.kan 

MG1655 amtA (cysQ)::TnW 

MG1655 cysQv.kan 

ET8000 cysQwkan 

TGI cysQv.kan 

DB5508 cysQ::Tn5supF 

TGI &(cysD-N)::kan 

594 with cysQ::Tn5tacl 

DBan41 with A {srl-recA)306: :Tn/0 

sup 0 dnaB(Am) 266 (kimm 2l -ban Pl ) 

rbs lacZ::lSI gyrA hutC k 

Msrl-recA\306 linked to Tn/0 

F-araD139b(ara-leu)7697 6JacX74 galU galK hsdR 

hsdM rpsL 
F~ prototroph 

F' proAB* traD36 lacF lacZAMIS supE hsd&5 thi 

&(lac-proAB) 
HfrRA2 thr. '.TnIO, clockwise transfer from 88 min 



Cloning vector 
X wild type 

XTn5/ac7 £221 cI857 Oam29 7*^80 
cysA + (5F7 of reference 24) 
cysQ+ (5B5 of reference 24) 
Derivative of X652 (6C8 of reference 24) 
Derivative of X656 



Amp r , cysQ: :Tn5tacJ (Kan 1 ) cysQ* 
Amp r , cysQ::Tn5tacl (Kan 1 ) cysQ* cpdB* 
Amp r , cysQ* cpdB + Cla\\\Jt6tacI A(0-52) 
Amp r , cpdB* 
Amp r , cysQ* 
Amp r , cysQ + 
Amp r , cysQv.cat 

Amp r , cysQv.kan 



Amp r Tet r 

Kan r amp(Am) tet(Am) 
TrrfsupF donor 



21 

B. Wanner 
48 

Laboratory collection 

28 

43 

D. Botstein (DB6128) 
44 

\cysQ::kan transduction 
PI transduction from AJ2653 
\cysQ::kan transduction 
kcysQ::kan transduction 
XcysQ::kan transduction 
This study 

\&(cysD-N)::kan transduction 

TnStacl transposition 

PI transduction from JC1289 

29 

37 

11 

8 

17 
47 

55 



38 

Laboratory collection 

9 

23 

23 

28 

28 

This study 

Laboratory collection 



pBR322 (Sail fragment from DBan41) 
pBR322 (C/al fragment from DBan41) 
pBRGanl02, small Cla\ deletion (KanO 
pBRGanl02, EcoKl deletion 
pBR322 (Sail fragment of DB747) 
pANHO, partial M spl digestion 
cat of pCM4 ligate into partial Saul A of 

pBRGanlll (Fig. 2 and 4) 
kan of pUC4K into EcoW site of 

pBRGanlll 

10 
6 

29 
43 



The amtA gene is identical to cysQ, as detailed in the text. 



in reactions containing purified ATP sulfurylase and dialyzed 
cell extracts as the source of ATP sulfurylase activator (31, 32). 

Genetic manipulation and analysis. Standard methods were 
used for (i) mutagenesis of E. coli with Tn5tacl with phage 
\: :Tn5tacl b221 cI857 0am29 Pam80 as a transposon donor 
(9) and (ii) Hfr conjugation and PI generalized transduction 
(48). Cysteine-requiring bacteria were tested for sensitivity 
to azaserine and to chromate by spotting dilutions of these 
agents on lawns of 10 7 bacteria spread on minimal glucose 



agar supplemented with cysteine, djenkolic acid, or glutathi- 
one and IPTG, as appropriate, and by growth in liquid 
cultures with progressive twofold differences in the concen- 
trations of these agents. 

To insert a transcription reporter into the cysQ gene, the 
DNA of cysQ plasmid pBRGanlll was partially digested 
with Sau3A y and full-length linear DNA was isolated after 
electrophoresis in low-melting-point agarose and ligated to 
the BamHl cat fragment from plasmid pCM-4 (10). The 
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religated DNA was used to transform the cysQ\:Tn5tacl 
strain DBan41, and plasmids that did not complement its 
cysteine auxotrophy were identified and characterized. 

To generate a cysQ::Tn5supF insertion mutant, Tn5supF 
was transposed from the donor plasmid in strain DB4496 (43) 
to cysQ* phage X656 (23, 24), and insertion-containing phage 
were selected by plaque formation on the dnaB amber strain 
DK21 (29, 43). Haploid TnJjwpF-containing bacterial recom- 
binants were obtained by infecting strain DB5508 (which 
contains amber mutant alleles of amp and tet genes) and 
selecting Sup + transductants by their resistance to ampicillin 
or tetracycline (44). Sup + transductants were screened for 
auxotrophy. To generate a cysQ::kan mutant, an EcoRl kan 
cassette from plasmid pUC4K was ligated into the EcoRl 
site in cysQ of pBRGanlll. This allele was recombined into 
X656 by infecting cells carrying the pBRGanlll-cysg::*a/i 
plasmid and selecting phage carrying the cysQ::kan allele by 
transduction of DB1434. \cysQ: :kan phage recovered from 
the lysogen were used to transduce nonlysogens and thereby 
obtain haploid cysQ::kan bacteria (28). 

Cys + revertants of cysQ mutant strains were obtained by 
growing young single-colony isolates in 2 ml of LN broth to 
stationary phase, washing the cells twice with 10 mM 
MgS0 4 , plating aliquots on minimal (cysteine-free) medium, 
and incubating for 2 days at 37°C. Reversion frequencies 
were measured by using several cultures from different 
single colonies to avoid jackpots. 

DNA sequence analysis. A 1-kb segment containing the 
cysQ gene was sequenced by the Sanger dideoxynucleotide- 
chain termination method with Sequenase (U.S. Biochemi- 
cal, Cleveland, Ohio) and single- and double-stranded DNA 
templates (27). Primer binding sites were provided by inser- 
tions of transposons Tn5tacl and Tn5supF in phage X656, by 
the promoterless cat gene in plasmid pBRGanlll DNAs, 
and by a universal primer binding site in M13mpl8 (38) (for 
sequencing an EcoRl-Pstl fragment containing the 3' end of 
cysQ). 

The oligonucleotides used as sequencing primers are as 
follows: (i) 5' CTCC ATTTTAGCTTCCTTAGCTCC , posi- 
tions 40 through 17 at the 5' end of the cat gene cassette; (ii) 
5' TGTC A A A AC ATG AG A ATTCCTCCCG , positions 43 
through 20 near the I end of Tv6tacl\ (iii) 5' GG A A AC AG A 
ATTCCCGGGG ATCCCC , positions 4549 through 4573 near 
the O end of Tn5tacl; (iv) 5' TAGGATCCCCTACTTGT 
GTA, positions 30 through 11 near the O end of Tn5supF; (v) 
5' TAGG ATCCCG AG ATCTG ATC , positions 236 through 
255 near the I end of Tn5supF; (vi) 5' GAGCGGCC 
A A AGGG AGC AG AC , positions 139 through 159 (middle 
primer) within Tn5supF with its 3' end toward the I end; (vii) 
5' GTAAAACGACGGCCAGT, the universal M13 sequenc- 
ing primer. 

Nucleotide sequence accession number. The nucleotide 
sequence of the cysQ gene shown in Fig. 4 has been 
deposited with GenBank under accession number M80795. 

RESULTS 

Initial detection and characterization of cysQ. The pro- 
totrophic strain E, coli DB747 was mutagenized with 
TnStacl, a transposon with an outward-facing tac promoter 
that is regulated by the lac repressor and IPTG (9). A 
conditional mutant that required cysteine for growth on 
minimal medium containing IPTG, but not on medium lack- 
ing IPTG, was isolated and named DBan41. Early charac- 
terizations of this strain revealed two other novel features. It 
did not require cysteine for normal growth in an anaerobic 



atmosphere. In addition, it formed slow-growing colonies on 
cysteine-free medium containing IPTG (after 2 to 3 days, 
instead of 16 h in the case of its Cys + parent). Cells in these 
colonies exhibited the same slow-growth phenotype, which 
indicated that the mutation was leaky, not highly revertible. 
The addition of IPTG to DBan41 in cysteine-free liquid 
medium lengthened the cell doubling time from about 80 min 
to 280 min. 

The cysteine requirement of DBan41 was satisfied by 
sulfite at 0.3 mM, which indicated a defect in the sulfate 
assimilation branch of the pathway (Fig. 1). The strain was 
as sensitive to chromate (MIC, 50 to 100 u.M) as its wild-type 
parent and also grew on low concentrations of thiosulfate (1 
to 2 mM). Sulfate uptake mutants are chromate resistant, 
and many are deficient in thiosulfate uptake (14, 40); there- 
fore this mutation seemed to affect a step leading to sulfite 
that follows sulfate uptake (Fig. 1). 

The mutation was mapped by genetic and molecular 
methods, (i) The cys* allele was transferred efficiently by 
Hfr strains BW6164 and CAG5052 to DBan41, which placed 
the mutation in the 88- to 07-min interval of the E. coli 
chromosome, far from other known cysteine biosynthetic 
genes (1). (ii) The cys* allele was cotransduced by phage PI 
at a frequency of 1% with zje\:Tnl0, an insertion at 94 to 95 
min (strain BW6458). (iii) The cys+ allele was also efficiently 
transduced by X656 (23, 24), a X phage clone that carries the 
segment of the E. coli chromosome from kb —4511 to kb 
—4525 (near 96 min). The mutant allele was recessive to the 
wild type in partial diploids, which were formed by X656 
infection of a X* lysogenic derivative of DBan41 (44), as well 
as in strains carrying the wild-type allele in multicopy 
plasmids. 

More refined map information came from molecular clon- 
ing (Fig. 2). (i) Kan r plasmids obtained by cloning Sail- or 
Cfal-digested DBan41 DNA in pBR322 contained an 18-kb 
Sail fragment (pBRGanlOl) or an overlapping 14-kb Clal 
fragment (pBRGanl02), respectively, (ii) A plasmid obtained 
by cloning Sa/I-digested wild-type f. coli DNA 
(pBRGanllO) that complemented the cysQ::Tn5tacl allele 
contained a 14-kb fragment whose restriction map matched 
that of the chromosome adjacent to the Tn5tacl insertion, 
(iii) A* deletion plasmid that retained only 1.8 kb of chromo- 
somal sequence but retained Cys + complementing activity 
was generated by partial Mspl digestion of pBRGanllO 
DNA (pBRGanlll). Comparisons of the restriction digest 
patterns of these clones with the known restriction map of 
the chromosomal region near 96 min (24, 36) indicated that 
Tn5tacl was at kb 4517, about 900 bp upstream of the cpdB 
gene. Transcription from the tac promoter in Tn5tacl was 
toward cpdB (clockwise). 

Cys~ phenotype not caused by cpdB overexpression. The 
CpdB protein has a 3 '-nucleotidase activity that can degrade 
PAPS to APS in vitro (4). Although CpdB protein seems to 
be primarily periplasmic, findings of cytoplasmic inhibitors 
for other periplasmic nucleotidases (36) suggested models in 
which CpdB also acted intracellular^ . Thus, in principle, 
transcription from the tac promoter might cause a cysteine 
requirement by increasing cpdB expression. Alternatively, it 
might alter the expression of an unknown gene next to cpdB. 
Three findings eliminated the simple CpdB-based model of 
cysteine auxotrophy: (i) the CpdB level in strain DBan41 
was increased less than 2-fold by IPTG (data not shown); (ii) 
the multicopy cpdB + plasmid (pBRGanl04) did not cause a 
cysteine requirement, although it did result in 10-fold higher 
CpdB activity (data not shown); and (iii) the cysteine re- 
quirement was complemented by plasmid pBRGan!02, 
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which contains a chromosomal segment including cpdB and 
the TnStacl insertion. These results implied that the cysteine 
requirement was due to altered expression of a previously 
unknown gene next to cpdB. This gene was designated cysQ. 

Direction oicysQ transcription. Insertions of a cat reporter 
gene were made to determine the orientation of cysQ and 
thereby to deduce whether auxotrophy resulted from over- 
expression or underexpression of cysQ after IPTG -induced 
transcription from the tac promoter in mutant strain 
DBan41. Five different insertions into Sau3A sites of plas- 
mid pBRGanlll that inactivated Cys + complementation 
activity were isolated; restriction mapping showed that each 
was within about 600 bp of the start of cpdB (Fig. 2). 
Insertions 4 and 5, oriented away from cpdB (toward 
Tni/ac/), conferred chloramphenicol resistance (25 u,g/ml), 
whereas insertions 1 and 3, in the opposite orientation, did 
not. Insertion 2, also in the opposite orientation but closest 
to cpdB, conferred weak resistance (—10 u,g/ml). This was 
attributed to a second promoter in cysQ that could allow 
cpdB transcription (see sequence analysis, below). Based on 
insertions 1,3,4, and 5, we inferred that cysQ is transcribed 
toward TnJ/ac/. 

Phenotypes conferred by cysQ null alleles. Chromosomal 
cysQ null mutations were generated and used to assess 
whether the distinctive leaky auxotrophy and its correction 
by anaerobic growth were allele or gene specific, (i) A 



cysQ::Tn5supF insertion allele was obtained by selecting 
transposition of Tn5supF to the cysQ* phage X656 (29, 43). 
One of 50 TnSsupF insertions resulted in cysteine auxotro- 
phy when recombined into the £. coli chromosome, and 
DNA sequencing (see below) showed that TxiSsupF was 
inserted in cysQ. (ii) A cysQ null allele marked with kana- 
mycin resistance was made by insertion of a kan gene at the 
EcoRl site in pBRGanlll (Fig. 2), recombined from the 
plasmid into X656, and then recombined from the cysQ::kan 
phage into bacterial chromosomes (28). (iii) A segment 
containing part of both the cpdB and the cysQ genes was 
deleted (ASnaBI; Fig. 2) to further test a possible involve- 
ment of cpdB in the cysQ mutant phenotype. This deletion 
was marked by insertion of kan and recombined into X656 
and from there into bacterial chromosomes. Finally, E. 
Barnes kindly provided us with a fourth cysQ null allele, 
which was generated by Tn/0 (TetO insertion and was 
originally designated amtA: :Tn/0 (21, 22). 

Each of these four cysQ null alleles resulted in a cysteine 
requirement that was leaky, corrected by anaerobiosis, and 
satisfied by sulfite when transduced into several E. coli K-12 
strain backgrounds, including DB747 (used to isolate the 
original Tn5/ac/ insertion), DB5508, and MG1655. These 
alleles were much less leaky in two other laboratory strains: 
TGI and ET8000. The strain background determines the 
leakiness of the cysQ mutant phenotype; cysQ derivatives of 
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FIG. 3. Growth of cysQv.kan mutants and wild-type parents. Growth was for 2 or 3 days at 37°C on M9 salts glucose medium (no added 
cysteine or sulfite). (Scattered white dots near the center of plate are a salt precipitate that often form in M9 solid medium.) The numbers 1, 
2, and 3 indicate strain backgrounds DB747, MG1655 and ET8000, respectively. cysQ::kan; +, Cys\ 



DB747 and MG1655 showed some growth on cysteine-free 
medium after 2 days at 37°C and small single colonies after 3 
days, whereas the corresponding derivative of ET8000 
showed barely perceptible growth only after 3 days (Fig. 3). 
All cysQ mutant strains grew normally on sulfite or under 
anaerobic conditions. The Cys" phenotype caused by the 
HcysQ cpdB) allele was identical to that caused by simple 
insertions in cysQ in these strain backgrounds. 

The match between the null mutant phenotype and that of 
the Tn5tacl insertion indicated that IPTG-induced transcrip- 
tion from Tn5tacl shuts off the expression of cysQ quite 
completely. The equivalence of Cys" phenotypes of the 
A(cysQ cpdB) and the simple cysQv.kan insertion alleles 
ruled out a model in which CpdB protein would consume 
PAPS and in which CysQ protein would regulate this con- 
sumption. 

DNA sequence oicysQ. A 1-kb segment containing the sites 
of insertion mutations that defined cysQ was sequenced by 
using primer binding sites provided by the Tn5tacl and 
Tn5supF transposons, the kan and cat insertions, and, for 
one segment, an M13mpl8 vector. All portions of this 
segment were sequenced on both DNA strands. The cysQ 
DNA sequence corresponded to a 246-codon open reading 
frame preceded by sequences that match consensus tran- 
scription promoters and translation initiation sites (Fig. 4). 
The first part of this sequence matched that found earlier in 
the 0.5 kb upstream of cpdB in £. coli; analyses of 5. 
typhimurium indicated a cysQ homolog in the same location 
in this species (35). The cysQ sequence we determined was 
identical to that reported for amtA (15). 

The DNA sequence confirmed that transcription of cysQ 
should diverge from that of cpdB, as suggested by cat 



reporter insertions. Tn5tacl was inserted within cysQ, just 
two codons from its 3' end (a fusion protein with 17 
additional amino acids is predicted). Tn5supF was inserted 
at codon 72 of cysQ. Only 17 bp separate the -35 regions of 
putative promoters for cysQ and cpdB, and an apparent 
consensus cyclic AMP receptor protein (CRP) binding site 
(placed such that CRP binding could stimulate cpdB tran- 
scription; see Fig. 4 legend) (35) overlaps the putative cysQ 
promoter. The low-level chloramphenicol resistance associ- 
ated with the cat insertion 2 is attributable to an additional 
promoter within cysQ (nucleotides 72 to 46, underlined in 
Fig. 4). (This promoter might also explain the relative 
weakness of the CRP-cyclic AMP dependence of cpdB 
expression [35].) Binding sites for the CysB positive regula- 
tory protein, found in the promoter regions of most cysteine 
biosynthetic genes (18), did not seem to be present in the 
cysQ promoter region. A search of the PROSITE data base 
of protein motifs (2) did not reveal significant matches to 
CysQ. However, we recently found strong amino acid se- 
quence level homologies between cysQ and suhB of E. coli 
(Fig. 5), mutations in which suppress certain rpoH missense 
alleles (apparently by elevating the levels of heat shock 
sigma subunit of RNA polymerase that it encodes [56]) and 
also between cysQ and genes for several eukaryotic pro- 
teins, including inositol monophosphatase (13, 39). 

Does cysQ participate in ammonium uptake? While prepar- 
ing this manuscript, we learned that cysQ corresponds to the 
gene called amtA (21, 22), a designation based on the finding 
of a Tn/0 insertion mutation (amtA::TnlO) that blocked 
growth on minimal (cysteine-free) medium containing very 
low levels of ammonium (<0.1 mM, rather than the >10 mM 
used in most media). It was proposed that amtA is needed 
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SD(cpd) 

AA TCATC AGGGAC ATCCra 

-10(cpd) -3 5(cpd) 

TAPTTATAGAACAglXS AAGAATGC CACAATTTTACG 

-35(cys) -10(cys) 
A GTTGGCGC ATTCATTAflCGftlft^ 

SLXcys) 1 CAT V 

AGACGAGCTGGAGAAA^ 

HetLeuAspGlnValCysGlnLeiiAlaArgAsnAlaGlyAspAla 
-10 -35 
46 ATTATGC AGGTCTACGACGGG^ 

IleMetGlnValTyrAspGXyThrLysProMetAspValValSerLysAlaAspAsn^ 

106 CCGGTAACGGCAGCGGAXATTCOCGCT 

ProValThxAlaAlaAspIleAl a AT aHi nThrVallleHetAapGlyLeuArgThrLeu 

CAT 1,4 (I)fni5supF(0) 

166 AnACCGGATGTTCCGGTCCTTTCTGAAGA 
TrurPrcAspValProValLeuSer^ 

226 TGGCAGCGTTACTGGCTGGTflGACCCGCT^ 
TrpGlnArgTyxTrpLeuValAspPr^^ 
EcoRl 

286 GGCGAATTCACCGTTAACATTGCGCTCAl^^ 
GlyGluPheThrVaJAsnlleAla^ 

346 TATGCGCCGGTAATGAACGTAATGTM^AGCGCGG 

TyrAlaProValMetAsnValifetTyrSerAlaAlaGliKilyLyaAlaTrpLysGluGlu 

CAT 3 

406 TGCGGTGTGCGCAAGCAGA!roCAGGTCTC 

CysGlyValArgLysGluIleGlnValArgAspAlaArgProProLeuValVallleSer 

466 CCTTC(XMX3C^^ 

ArgSerHisAlaAspAlaGluLeuLysGluTyrLeuGXnGlnl^uGlyGluHisGlnTto 

526 ACGTCCATCGGCTCTTr^^ 

ThrSerlleGlySerSerLeuLysPheCysI^uValAlaGluGlyGlnAlaHisValTy^ 

Pal 

586 CCGCXXTTTCGGACCAACGAAIATT^ 

ProArgPheGlyProThr As nl leTrpAspO'hiALaAlaGly HisAlaValAlaAlaAla 

646 GCXXSGAGCGCACGTTCACGACTGGCM^^ 

AlaGlyAlaHisVa 1 H i sAspTrpGlnGlyLys ProLeuAspTyrThr ProAr gG luSer 

(0)Tn5tacl(I) 741 
706 TTCCTGAATCCGGGGTTC ^^ 

PheLeuAsnProGlyPheArgValSer I leTyrEnd 

TTCCTGATTCTGCC^TCCTGATTT^^ 
TATTTAAAGTGCAAAAATTCAATTGCTAATAAGTTACA 

FIG. 4. DNA sequence of the cysQ gene. Sites of insertion and orientations (in cases of transposons) are indicated, as are putative 
promoters for cysQ and cpdB and the consensus CRP binding site. The CRP binding site identified in ref. 35 extends from positions -72 to 
-96, relative to the start of cysQ translation (beginning at TT in the -35 region of the cysQ promoter). The DNA sequencing protocols and 
sequences of the oligonucleotide primers used are given in Materials and Methods. 



for active ammonium uptake (21). Our reconstruction exper- 
iments showed, however, that even cysQ* (amtA + ) bacteria 
grew very poorly on low-ammonium medium (Fig. 6). 
Strains with the amtAr.TnlO or cysQ::kan insertion muta- 
tions failed to form colonies on this medium, as reported 
earlier (21). The mutants did grow, however, when this 
medium was supplemented with cysteine (Fig. 6) or sulfite, 
in which case these mutant strains were indistinguishable 
from their Cys* parents in colony size. Since neither sulfite 
nor cysteine can be used as an ammonium source by E. coli 
(53), these results are not consistent with the interpretation 
(21) that the AmtA (CysQ) protein is needed for ammonium 
uptake. 



Possible roles for cysQ. We tested whether cysQ might be 
needed in sulfate uptake or activation. No effect of cysQ null 
mutations was found on the rapid uptake of labelled sulfate 
from the medium in the background of strain TGI or DB5463 
(nonleaky and leaky cysteine requirements, respectively). In 
contrast, a cysDN (ATP sulfurylase) deletion strain was 
severely deficient in sulfate uptake, as expected (14) (data 
not shown). No significant differences between cysQ mutant 
and parental strains were detected in levels of ATP sulfury- 
lase, which catalyzes synthesis of APS, the first activated 
sulfur intermediate. cysQ mutations also had no effect on the 
level of the activator of ATP sulfurylase (data not shown). 

Reversion of cysQ null mutants. Cys + revertants of 
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MLDQVCQLABNAGDAIMQVYDGTKPMDWSKADNSPVTAADIiUUiWIMDG (1-51) CysQ 
ML AR AG I Y+ ++ K+ N VT D AA VI+D 

MHPMLNIAVRAARKAGNLIAKNYETPDAVEASQKGSNDFVTNVD KAAEAVI IDT (1-54) SuhB 



I^TLTPDVPVLSEEDPPGWEVRQHWQRYWLVDPLDGTKEFIKRNGEFTVNIALI 
+R P +++EE E W++DPLDGT FIKR F V IA+ 

IRKSYPQHTIITEE-SGEI£GTDQ-DVQWVIDPIJ)GTTNFIKRLPHFAVSIAVR 



(52-105) CysQ 
(55-106) SuhB 



DHGKPII^VVYAPVMNVMYSAAEGKAWKEECGVRK-OIQVRDARPPLVVISRSH 

+G+ ++WY P+ N +++A G + G R RD ++ 

IKGRTEYAVVYDPMRNELFTATRGQG-AQLNGYRLLGSTARDLDGTILATGFPF 



(106-157) CysQ 
(107-159) SuhB 



ADAEL KEYLQQLGE HQTTSIGSS-LKFCLVAEGQAHVYPRFGPTNIWD 

Y++ +G GS L VA G + G WD 

KAKQYATTYINIVGKLFNECADFRRTGSAALDLAYVAAGRVDGFFEIG-LRPWD 



(158-205) CysQ 
(160-212) SuhB 



(206-246) CysQ 



TAAGHAVAAAAGAHVHDWQGKPLDYTPRESFLNPGFBVSIY 

AAG + AG+ V D G Y + RV 

FAAGEIJ*VREAGGIVSDFTGGH-NYMLTGNIVAGNPRVVKAMLANMEU)ELSDALKR (213-267) SuhB 

FIG. 5. Alignment of amino acid sequences of inferred protein products of cysQ and suhB (adapted from data in reference 39). Identities 
are indicated by placements of conserved amino acids in the middle line; conservative substitutions (+) are indicated. Overlined segments 
indicate regions with high sequence similarity to inositol monophosphatase (13). The CysQ (SuhB) initial amino acid alignment score of 229, 
calculated by using the FASTA program (42), was 28 standard deviation units above the mean initial score of 24.6 for comparisons of CysQ 
to the other sequences in the PIR (release 28) protein data base. In a test using the Dayhoff Relate program, there were four segments of 25 
amino acids in length that were more than eight standard deviations above the mean. This indicates strong homology: the probability of getting 
a single segment with such a deviation from a random sequence by chance alone is less than 10" 13 (42). 



cysQ::kan and cysQ::Tnl0 mutants were obtained at fre- 
quencies of about 10 -6 with derivatives of TGI and ET8000 
(nonleaky cysQ mutant phenotype) and >10" 5 with deriva- 
tives of DB747 and MG1655 (leaky cysQ mutant phenotype); 
these differences in recovery probably reflect the greater 
leakiness of cysQ mutations in the DB747 and MG1655 
backgrounds. The revertants were heterogeneous in colony 
size on cysteine-free medium but grew as well as their cysQ + 
ancestors on cysteine-containing medium and retained the 
Kan r or Tet r traits of their Cys~ parents. The parental cysQ 
mutant alleles were recovered from several revertants by 
transduction and selection for the appropriate resistance 
trait (Kan r or Tet^. Two spontaneous reversion mutations 
that allowed relatively good growth on cysteine-free medium 
were mapped in rlfr x F~ crosses and then by transduction 
with several candidate X phage clones (marked by insertion 
of aTnicam transposon [44, 50]). These reversion mutations 
were found to be in the segment carried by X419, a phage 
clone that also carries the cysPTWA (sulfate binding and 
uptake) operon. The two revertants tested were found to be 
defective in sulfate uptake, unlike their cysteine-requiring 
parents. Partial diploids generated by lysogenizing rever- 
tants with \419::Tn5cam and a \ + helper required cysteine, 
indicating that the reversion mutation is recessive and thus 
probably due to loss of function. This Cys~ phenotype was 
unstable, however, because of frequent homogenotization 
for the parental (nonrevertant) allele. 

DISCUSSION 

The initial steps in the sulfate assimilation branch of the 
cysteine pathway entail sulfate uptake, its activation via 
formation of APS and conversion to PAPS, and then its 
reduction to sulfite (Fig. 1). The mutational and sequence 
analyses presented here identified a previously uncharacter- 
ized gene, cysQ, whose product is needed for proper meta- 
bolic functioning of this part of the pathway. We propose 
below that CysQ acts on PAPS. The cysQ gene was mapped 



to a locus at -96 min in the E. coli chromosome, which is far 
from other cys genes. The cysQ promoter region overlapped 
a CAP binding site that is implicated in the control of 
expression of the adjacent gene, cpdB (35), and it did not 
contain a good match to the consensus binding site for the 
CysB regulatory protein (18). Hence, the expression of cysQ 
may be controlled differently from that of most other cys 
genes. The auxotrophy resulting from mutations in cysQ was 
leaky in some strain backgrounds and was compensated by 
mutations in other genes; cysQ mutants were prototrophic 
during anaerobic growth. 

A precedent for cysteine biosynthetic genes that are not 
needed during anaerobic growth is provided by cysl and cysJ 
in S. typhimurium (3). This case reflects the presence of 
additional anaerobic sulfite reduction (asr) genes. E. coli 
lacks such asr genes (20), however, and our studies indicate 
that cysQ acts before, not after, sulfite formation (Fig. 1). 
cysB is also only needed during aerobic growth (3), suggest- 
ing that a separate transcriptional activator may be operating 
anaerobically. cysQ does not seem to be a transcriptional 
activator of cys genes, since it does not significantly affect 
the rate of sulfate uptake or the level of ATP sulfurylase or 
its protein activator. 

cysQ is identical to am/A, a gene which had been thought 
to participate in ammonium transport (21). That interpreta- 
tion was based on a failure of mutant strains to grow on 
low-ammonium cysteine-free medium or to take up methy- 
lammonium at a high rate when they were grown with 
arginine in place of ammonium (21). We found that the 
growth defect of cysQ (amtA) mutants was compensated by 
sulfite or cysteine (Fig. 6), neither of which serves as a 
nitrogen source (53). The initial failure to recognize the 
cysteine requirement of the amtA mutant (21) may have been 
due to leakiness on normal minimal medium (Fig. 3) or 
inadvertent selection of a (partially) compensating suppres- 
sor mutation. The inability of cysQ (amtA) mutants to grow 
on low-ammonium medium probably results from the com- 
bined effects of partial starvation for ammonium (because of 
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FIG. 6. Comparison of wild-type and cysQ and amtA mutant bacterial strains on modified M9 minimal medium and either low (0.1 mM) 
or normal (10 mM) concentrations of ammonium acetate. Strains: cysQ*, ET8000; cysQ mutant, ET8000 cysQr.kan (DB6908); amtA mutant, 
ET8000 amtA::TnlO (AJ2653) grown in medium containing 0.2 mM cysteine. The plate with 10 mM ammonium was incubated for 1 day and 
the plate with 0.1 mM ammonium was incubated for 2 days at 37°C before being photographed. Identical growth patterns were obtained with 
sulfite in place of cysteine, but only Cys + parental strains grew on minimal low-ammonium medium lacking sulfite or cysteine. Equivalent 
weak growth of mutant and Cys + sibling strains on the low-ammonium medium was also observed with all other lineages tested (DB747 and 
its cysQ derivative DB5656, MG1655 and its cysQ derivative DB6316, TGI and its cysQ derivative DB6913). Growth was weaker on medium 
containing 0.025 mM rather than 0.1 mM ammonium acetate, as expected, since ammonium was limiting. In the cases of Cys* strains, this 
slow growth was not stimulated by adding 10 or 20 mM cysteine (or sulfite or thiosulfate), whereas growth was stimulated by adding 20 mM 
glutamate or arginine (which serve as ammonium sources [53]). 



the medium) and for cysteine (because of a mutation). 
Although the inefficient induced methylammonium uptake 
by amtA cells was also interpreted to reflect a specific uptake 
defect, the reported data (21) indicate that the basal level of 
uptake was not affected by the amtA mutation. Earlier work 
had shown that induction by growth in arginine reflects the 
slow release of ammonium from this source, relative to the 
rate of ammonium consumption (45, 53). Because the poor 
growth of cysQ (amtA) mutants on cysteine-free medium 
should allow the arginine-derived ammonium to accumulate 
to repressing levels, we do not find it necessary to postulate 
a role for CysQ (AmtA) in ammonium or methylammonium 
uptake. 

How does CysQ act in the synthesis of sulfite and cyste- 
ine? Several possible roles have been eliminated by our 
results to date: (i) sulfate uptake, (it) stabilization of ATP 
sulfurylase, (iii) synthesis or stabilization of the ATP sulfury- 
lase activator, and (iv) modulation of CpdB. In addition, the 
cysQ sequence does not match that of ppa y the gene for 
pyrophosphatase (30), an enzyme probably needed for effi- 
cient APS synthesis (Fig. 1). The leakiness of cysQ-nu\\ 
alleles in many strain backgrounds might reflect (i) a second 
gene with a functionally related role; (ii) an intrinsic activity 
of gene(s) that can mutate to give a Cys + revertant pheno- 
type; or, if cysQ is regulatory, (iii) strain background- 
dependent differences in the quantitative effects of CysQ on 
the gene, protein, or metabolite that is the target of its 
control. 



Studies of Cys + revertants are providing insights into how 
CysQ may act. Several spontaneous reversion mutations 
were mapped to a region that includes cysTWA permease 
genes, were recessive to the wild-type (nonrevertant) alleles, 
and were defective in sulfate uptake. These results suggested 
that reversion results from loss of function, not from an 
unusual expression of a silent or cryptic suppressor gene. 
Accordingly, we have begun to isolate transposon insertions 
that restore prototrophy to cysQ mutants in a nonleaky 
background (50). One insertion that resulted in very small 
colonies on cysteine-free medium was in cysA, which en- 
codes a subunit of sulfate permease (49). Transduction of 
this insertion into a cysQ + strain resulted in the same 
small-colony phenotype, indicating that the phenotype re- 
flected loss of cysA function, not poor suppression of the 
cysQ mutation. A second insertion, which resulted in colo- 
nies of nearly normal size, was in cysP> a gene whose 
product contributes to efficient sulfate and thiosulfate bind- 
ing (19). In interpreting these reversion data we draw on 
early findings that mutations in cysH or in trxA plus grx 
cause poor growth, apparently because accumulated PAPS 
or a derivative of it is toxic, and that these mutations can be 
compensated by mutations inactivating sulfate permease (16, 
45a). Although the role of cysP in the cys pathway is not 
understood, the ability of permease mutations to compen- 
sate for the defect in cysQ suggests that CysQ also acts on 
PAPS. Perhaps CysQ participates with the CysH sulfotrans- 
ferase to generate sulfite. Alternatively, perhaps CysQ se- 
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questers or consumes excess PAPS or a toxic derivative of 
it. On this latter view, cysteine might be needed for growth 
of cysQ mutants only to allow repression of cys gene 
expression and thereby decrease PAPS synthesis, rather 
than to compensate for a missing biosynthetic enzyme. 
CysQ exhibits striking amino acid sequence homology to 
mammalian inositol monophosphatase as well as to the 
product of the suhB gene of E. coli (Fig. 6) (39). The 
homology between CysQ and inositol monophosphatase, in 
particular, encourages models in which CysQ acts on a 
phosphorylated metabolite such as PAPS, possibly ensuring 
that it plays its essential biosynthetic role without toxicity to 
the cell. 
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Sinorhizobium sp. strain BR816 possesses two nodPQ copies, providing activated sulfate (3'-phosphoade- 
nosine-5'-phosphosulfate [PAPS]) needed for the biosynthesis of sulfated Nod factors. It was previously shown 
that the Nod factors synthesized by a nodPQ double mutant are not structurally different from those of the 
wild-type strain. In this study, we describe the characterization of a third sulfate activation locus. Two open 
reading frames were fully characterized and displayed the highest similarity with the Sinorhizobium meliloti 
housekeeping ATP sulfurylase subunits, encoded by the cysDN genes. The growth characteristics as well as the 
levels of Nod factor sulfation of a cysD mutant (FAJ1600) and a nodPl nodQl cysD triple mutant (FAJ1604) 
were determined. FAJ1600 shows a prolonged lag phase only with inorganic sulfate as the sole sulfur source, 
compared to the wild-type parent. On the other hand, FAJ1604 requires cysteine for growth and produces 
sulfate-free Nod factors. Apigenin-induced nod gene expression for Nod factor synthesis does not influence the 
growth characteristics of any of the strains studied in the presence of different sulfur sources. In this way, it 
could be demonstrated that the "household" CysDN sulfate activation complex of Sinorhizobium sp. strain 
BR816 can additionally ensure Nod factor sulfation, whereas the symbiotic PAPS pool, generated by the nodPQ 
sulfate activation loci, can be engaged for sulfation of amino acids. Finally, our results show that rhizobial 
growth defects are likely the reason for a decreased nitrogen fixation capacity of bean plants inoculated with 
cysD mutant strains, which can be restored by adding methionine to the plant nutrient solution. 



Sulfur is a macronutrient that is required by all organisms. It 
forms constituents of proteins, lipids, carbohydrates, electron 
carriers, and numerous cellular metabolites. Sulfate is the most 
abundant source of utilizable sulfur in the aerobic biosphere. 
The sulfate assimilation complex, required for the formation of 
the sulfur-containing amino acid cysteine, has been the subject 
of intensive study in Escherichia coli (21). Cysteine is the cen- 
tral precursor of all organic molecules containing reduced sul- 
fur, ranging from the amino acid methionine to peptides, pro- 
teins, vitamins, cofactors such as S-adenosylmethionine, and 
hormones. 

Like all inorganic nutrients, sulfate is transported into cells 
by highly specific membrane transport systems (18). Sulfate 
assimilation requires its prior activation to adenylate com- 
pounds via a pathway that seems to be similar in all organisms. 
The activation is achieved by the ATP sulfurylase-catalyzed 
reaction of sulfate with ATP to give adenosine 5'-phosphosul- 
fate (APS), coupled with GTP hydrolysis. Subsequently, APS is 
phosphorylated by an APS kinase to produce 3'-phosphoade- 
nosine-5'-phosphosulfate (PAPS). In E. coli, ATP sulfurylase 
is encoded by cysD and cysN, whereas the APS kinase is en- 
coded by cysC (27, 28). PAPS is then enzymatically reduced by 
the cvj//-encoded PAPS reductase (also known as PAPS sul- 
fotransferase) to sulfite, which enters the cysteine biosynthetic 
pathway. 

PAPS also serves directlv as a sulfate donor for the forma- 
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tion of sulfated compounds. For example, RhizobiumAegume 
symbiotic interactions are mediated by a host-specific bacterial 
signaling molecule (the Nod factor), which can be sulfated. In 
general, rhizobial species that produce sulfated Nod factors 
possess at least two sulfate activation systems (6, 12, 24, 25, 40). 
The three genes that are indispensable for Nod factor sulfa- 
tion, nodP, nodQ, and nodH, were first isolated from Sinorhi- 
zobium meliloti. Together, nodP and nodO encode both ATP 
sulfurylase and APS kinase activities (45, 47), whereas the 
nodH gene product, a sulfotransferase, directly transfers the 
activated sulfate moiety to the Nod factor backbone (8, 44). 
NodP is homologous to E. coli CysD, while the amino- and 
carboxy-terminal domains of NodQ are homologous to £. coli 
CysN and CysC, respectively. In a recent study, it was reported 
that the specificity of phytopathogen-host interactions also can 
be controlled bv a sulfated avirulence effector molecule, which 
is yet to be identified (48). The rice pathogen Xanthomonas 
oryzae pv. oryzae RaxP and RaxQ proteins are responsible for 
the synthesis of an activated form of sulfate and are similar to 
the NodP and NodQ host specificity proteins of the bacterial 
svmbiont S. meliloti. 

In 5. meliloti, two copies of the nodPQ operon are present. 
Both copies are involved in Nod factor sulfation but are not 
necessary for cysteine biosynthesis. Recently, in S. meliloti and 
in Rhizobium tropici CFN299, homologues of the cysDN (ATP 
sulfurylase) and cysH (APS reductase) genes were isolated, but 
no homologue of the E. coli cysC gene (APS kinase) could be 
identified (1, 23). Consequently, it was demonstrated that in 5. 
meliloti, APS rather than PAPS is reduced for sulfite produc- 
tion during cysteine biosynthesis (1). Other members of the 
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Strain or plasmid 



Sinorhizobium sp. strains 
BR816 
FAJ1600 
FAJ1604 
CFNE205 
CFNE206 
CFNE207 
CFNE208 



TABLE 1. Bacterial strains and plasmids 



Relevant characteristics 



Broad-host-range Sinoriiizobium strain isolated from Leucaena leucocephala 
cysD mutant of BR816; Tc r 

nodPl nodQ2 cysD triple mutant of BR816; Km r Sp r Tc r 

nodPl mutant of BR816; Km r 

nodQ2 deletion mutant of BR816; Sp r 

nodPl nodP2 double mutant of BR816; Km r Sp r 

nodPl nodQ2 double mutant of BR816; Km r Sp r 



Reference or source 



16 

This study 

This study 

25 

25 

25 

T. Laeremans, 
unpublished results 



Plasmids 
pBRE4.8 
pJO200ucl 
pHP45H-Tc 
pUC18/19 



pUC19 carrying the BR816 cysDN genes; Ap r 
B. subtilis McB-containing suicide vector; Gm r 
Vector containing Tc r cassette 
Cloning vector; Ap r 



This study 

39 

38 

33 



Rhizobiaceae, differing in their ability to incorporate sulfate in 
either a Nod factor or lipopolysaccharide, also preferentially 
reduce APS instead of PAPS for cysteine biosynthesis. This 
implies that APS reduction is not necessarily correlated with 
the presence of PAPS-dependent sulfurylation reactions for 
symbiosis, which is the case when functional nodPQ genes are 
present (1). Recently, Kopriva et al. (20) have described a 
phylogenetic classification of APS and PAPS reductase amino 
acid sequences (both annotated as CysH) from different or- 
ganisms. The resulting sequence-based prediction of the sub- 
strate specificities of these enzymes was confirmed by Williams 
et al. (58), using genetic complementation experiments. 

Sinorhizobium sp. strain BR816 (formerly Rhizobium sp. 
strain BR816) synthesizes Nod factors that are fully sulfated at 
the reducing terminal residue (50), as is the case for the nar- 
row-host-range S. meliloti (26). The sulfate decoration on the 
Nod factors secreted by S. meliloti is essential for nodulation of 
alfalfa (40). Except for 5. meliloti. it is still unclear whether 
rhizobia producing sulfated Nod factors use only the nodPQ- 
dependent PAPS pool as a source of activated sulfate for Nod 
factor sulfation, the housekeeping PAPS pool, or both (25). 
Previously, Laeremans et al. (25) demonstrated that Sinorhi- 
zobium sp. strain BR816 possesses two nodPQ copies. Al- 
though both copies are functional, as demonstrated by genetic 
complementation of an R. tropici nodP mutant, the double 
mutants did not show any detectable changes in the amount of 
sulfated Nod factors produced by this strain (25). It was sug- 
gested that in Sinorhizobium sp. strain BR816, in contrast to S. 
meliloti, a housekeeping locus as a third PAPS-producing locus 
could be involved in the sulfation of the Nod factors. 

We have isolated the cysDN homologues of Sinorhizobium 
sp. strain BR816 and studied the role of this third PAPS- 
producing locus in relation to Nod factor synthesis. In addition, 
we were interested to know how the various forms of activated 
sulfate may be partitioned into the pathways for amino acid 
biosynthesis and sulfation or methylation of Nod factors and 
other compounds important during symbiosis. Furthermore, 
based on the analysis of the phylogenetic relationship among 
rhizobial ATP sulfuryiases, we speculate on the possible origin 
and functionality of genes for sulfate activation. 



MATERIALS AND METHODS 

Bacterial strains and growth conditions. The bacterial strains and plasmids 
used in this study are listed in Table 1. E. coli strains were maintained on 
Luria-Bertani agar at 37°C and grown in Luria-Bertani broth (32). Rhizobial 
strains were maintained on yeast extract-man nitol medium (55) or on tryptone- 
yeast medium with added CaCI 2 (3) at 30°C Antibiotics were added to the 
medium as needed at the following concentrations (micrograms per milliliter): 
ampicillin. 100; spectinomycin, 50; kanamycin. 50; and nalidixic acid, 31. Tetra- 
cycline was added to a final concentration of 1 jig/ml (for Sinorhizobium sp. strain 
BRS16) or 10 u.g/ml (for E. coli). Triparental conjugations and site-directed 
mutagenesis were done as previously described (31). 

Nucleic acid manipulations and analysis. Isolation and cloning of plasmid 
DNA was performed as described previously (2, 42). Total genomic DNA of 
Sinorhizobium sp. strain BR816 was isolated by using a genomic DNA isolation 
kit (Centra Systems) according to the manufacturer's instructions. DNA frag- 
ments were recovered from agarose gels by using the Nucleotrap kit (Macherey- 
Naget). Southern blotting and hybridizations were carried out as previously 
described (25). Sequencing of DNA fragments cloned in the pUC18-pUC19 
vectors was performed on an automated ALF sequencer with fluorescein-labeled 
universal and synthetic oligonucleotide primers (Arncrsham Pharmacia Biotech, 
Uppsala, Sweden). Database searches for similarity were performed with the 
B I- AST software (National Center for Biotechnology Information. National In- 
stitutes of Health). 

PCR was performed with Taq DNA polymerase (Boehringer, Mannheim, 
Germany) according to the manufacturer's protocol. For sequencing, the high- 
fidelity Platinum Pfx DNA polymerase (GIBCO-BRL, Life Technologies) was 
used according to the manufacturer's protocol. 

To construct a genomic minilibrary, total genomic DNA from Sinorhizobium 
sp. strain BR816 was digested with EcoRL DNA fragments ranging between 4 
and 6 kb were recovered and ligated into the pUC19 cloning vector. Eight 
hundred Ap l white colonics were picked up. Plasmid DNA was purified from 15 
pools consisting of approximately 50 colonies, and efficient insertion of fragments 
of the desired size was confirmed. A 450-bp PCR fragment containing an internal 
part of cysD was used as a probe to screen the library. 

Phylogenetic analysis of CysD homologues. The amino acid sequences of 19 
CysD-like proteins, truncated to the same size as the shortest sequence (position 
3 to 299 from the S. meliloti NodPl sequence [gi 14523565]) were aligned by using 
the ClustalW program (http://searchlauncher.bcm.tmc.edu/multi-align/multi 
-align.html). The construction of neighbor-joining trees (41) and bootstrap anal- 
ysis of 1,000 rcsamplcs were performed by using the Treecon for Windows (1.3b) 
software package (53). In estimating evolutionary distances between amino acid 
sequences, we used the Poisson correction. Insertions and deletions were not 
taken into account. For constructing trees by the parsimony method, the 
PROTPARS program in the PHYLIP package was used (10). Again, bootstrap 
analysis of 1.000 resampics was performed. 

Growth tests. Growth tests of Sinorhizobium sp. strain BR816 in sulfatc-frec 
liquid medium were carried out in acid minimal salts (AMS) medium (36) 
containing 1 mM CaCl 2 with sulfate salts replaced by equimolar amounts of 
alternative salts (MgCU. ZnCU, MnCl 2 , and CuCU). Ammonium chloride (10 
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FIG. 1. (A) Physical and genetic maps of the BR816 cysDN region. 
The triangle indicates the position of the inserted Tc r cassette in the 
mutants FAJ1600 and FAJ1604. (B) Schematic view of the constructed 
Sinorhizobium sp. strain BR816 mutants with mutations in the nodPQ 
genes and cysDN genes (see Table 1). Triangles indicate inserted 
antibiotic resistance cassettes. 



mM) and mannito! (10 mM) were used as nitrogen and carbon sources, respec- 
tively. Sulfur compounds (sodium sulfate, sodium sulfite, L-cystcine, and i.-me- 
thioninc) were filter sterilized and added to the autoclaved medium at a con- 
centration of 15 \xM. When appropriate, cell cultures were induced with 500 nM 
apigenin. Cells of the strains tested were grown overnight in tryptonc-yeast 
medium, washed twice in suifaic-frce AMS medium, brought to an optica) 
density of 0.4 (measured at 600 nm with a Perkin-Elmer lambda 2 spectrometer), 
and diluted 6,000-fold in sulfate-free AMS medium with the appropriate con- 
centrations of filter-sterilized antibiotics, apigenin, and sulfur compounds. Bac- 
teria were grown in microliter plates (final volume, 300 u.1) over a 4-day period, 
and cell growth was monitored automatically by measuring the optical density at 
600 nm in BioscreenC (Labsystems) every 30 min. For each time point, the 
average optical density was calculated from five independent measurements. 

Insertion mutagenesis. A Sinorhizobium sp. strain BRSI6 cysD single mutant 
and nodPl nodQl cysD triple mutant were constructed as follows. To obtain the 
cysD single mutant, the 1.6-kb Smal fragment of pBRE4.8 was ligated into the 
Smal site of pJQ200ucl. This vector allows positive selection of double homol- 
ogous recombinants on sucrose (10%)-containing medium due to the presence of 
the Bacillus subtil is sacB gene. The resulting plasmid was digested with Bamlll 
and then blunt-end ligated to the Smal fragment containing the Q-Tc r cassette 
from pHP45fl-Tc. This plasmid was conjugated to Sinorhizobium sp. strain 
BR816. Correct insertion of the Tc r interposon was verified by Southern hybrid- 
ization with the cysD gene and the Tc 1 cassette as probes. In this way, the same 
construct was introduced in CFNE205 {nodPl), CFNE206 (nodQ2), CFNE207 
{nodPl nodP2), and CFNE208 (nodPl nodQl) (Table 1; Fig. 1). A cysD single 
mutant (FAJ1600) and a nodPl nodQl cysD triple mutant (FAJ1604) were 
obtained and retained for further analysis. 

Radioactive labeling of Nod metabolites and thin-layer chromatography 
(TLC) analysis. Nod factors were labeled by using the isotopes p 4 C]acetatc and 
[ 35 Sjsulfate according to a slightly modified version of the protocol of Mcrgaert 
et al. (30), as previously described (25). For this experiment, Nod factors were 
purified from cells grown in sulfate-free AMS minimal medium supplemented 
with L-cystcine, as described for the growth tests. 

Plant nodulation assay. Seeds of Phaseolus vulgaris cv. BAT477 were surface 
sterilized and germinated as described previously (56). Bean seedlings were 
planted in 250-ml flasks containing a nitrogen-free Snoeck medium agar slant (C. 
Snoeck, J. Vandcrlcyden, and E. Schrevens, submitted for publication) with 
KH 2 P0 4 (7.49 mM), K 2 S0 4 (0.43 mM), CaCl 2 (2.65 mM), MgCU (1.75 mM), 
MgS0 4 (1.2 jxM), FcNaEDTA (50.8 u,M). MnSO, (35.2 jjlM), CuS0 4 (0.5 u,M), 
ZnS0 4 (1.5 u.M). H3BO3 (25 u.M), and (NH 4 ) 6 Mo 7 0 24 (0.07 u,M), with sulfate 
as the sole sulfur source unless otherwise stated. The seedlings were inoculated 



with approximately 10 6 bacteria per plant, from a diluted overnight culture that 
was washed twice with sulfate-free AMS medium. The plants were maintained in 
a growth chamber at 26°C (day) and 22°C (night) with a 1 2-h photoperiod. Plants 
were harvested after 3 weeks. Uninoculated control plants did not show any 
nodules or nodule-like structures. Ten plants per strain were tested in each 
experiment. Nitrogenase activity was determined by measuring the acetylene 
reduction activity of nodulated roots in closed vessels with a Hewlett-Packard 
5890A gas chromatograph equipped with a PLOT fused silica column, with 
propane as an internal standard. 

Data analysts. In all experiments, a randomized block design was used with 10 
replicate blocks. Nodule number, nodule dry weight, and acetylene reduction 
activity were analyzed with the means and general linear model procedure (SAS 
Institute, Cary, N.C.). Comparison among the mean values obtained for each 
strain was made by Tukey's multiple-range test with a 95% confidence limit. 

Nucleotide sequence accession number. Nucleotide sequence data were de- 
posited in the GenBank database under accession number AJ505754. 

RESULTS 

Cloning and sequencing of a third PAPS-producing locus in 
Sinorhizobium sp. strain BR816. Previous work provided evi- 
dence for the presence of a putative third PAPS-producing 
locus in Sinorhizobium sp. strain BR81.6 on an approximately 
4.8-kb £c#RI genomic DNA fragment (25). In order to clone 
this third copy of sulfate activation genes, a genomic minili- 
brary was constructed (see Materials and Methods), and a 
single positive clone, pBRE4.8, was obtained. Since the in- 
serted genomic DNA region corresponding to the cysD gene 
was incomplete, the missing part of cysD was obtained by PCR 
with primers that were designed based on existing knowledge 
of the genomic organizations and DNA sequences of sulfate 
assimilation genes in other Rhizobium spp. (1, 23). 

A physical map of the 4.8-kb EcoRl fragment and the up- 
stream 442-bp PCR fragment was established (Fig. 1A), and 
the nucleotide sequence was determined. Similarity with an 
ATP sulfurylase encoded by the cysD and cysN genes of S. 
meliloti, R. tropici CFN299, and E. coli was found. Partial se- 
quence similarity upstream of the cysD gene revealed the pres- 
ence of a cysH homologue, encoding an APS or PAPS reduc- 
tase, whereas no cysC homologue was found in the sequenced 
fragment. The same organization is found in S. meliloti and R. 
tropici (1, 23). It is likely that all three open reading frames are 
in a single operon, since no promoter consensus sequences or 
transcription termination signals were found in the intergenic 
cysH-cysD sequence of BR816. A similar situation was ob- 
served in S. meliloti. where two transcriptional start sites were 
identified, both upstream of the cysH homologue (1). In con- 
trast, in E. coli, cysH does not form an operon with cysDNC 
(21). The nodP and nodQ homologues have a lower percent 
G-r C content than the cysD and cysN homologues (data not 
shown), as observed for the 5. meliloti genome (13). 

The Sinorhizobium sp. strain BR816 cysD and cysN genes 
encode proteins of 317 and 498 amino acids, respectively. 
Strong conservation of amino acid residues was found with the 
respective CysD and CysN proteins of S. meliloti (96 and 91% 
identity, respectively), R. tropici (89 and 82% identity), and E. 
coli (68 and 52% identity). CysN contains the characteristic 
GTP-binding motif (GxxxxGK, DxxG, and NKxD) (7) and also 
an ITI motif, which is conserved among elongation factors 
(19). In comparison to the NodQ peptides, the deduced amino 
acid sequence of cysN lacks the carboxy- terminal part that 
corresponds to E. coli CysC. Therefore, no ATP-binding or 
PAPS-binding motifs were found. Similar observations were 
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FIG. 2. Schematic representation of sulfate assimilation loci of se- 
lected strains for construction of a phylogenetic tree (Fig. 3). Abbre- 
viations: Sm, S. meliloti (NodPI, gil4523565; NodP2, gil5140612; 
CysD, gi5911360); Sbr, Sinorhizobium sp. strain BR816 (NodPI, 
gi2148989; NodP2, gi27125923; CysD, gi24528409); Rt, R iropici 
CFN299 (NodP, gil 280528; CysD, gi7387610); Bm, Brucella melitmsis 
(CysD, gil7988038); N33, Mesorhizobium sp. strain N33 (NodP, 
giI531624); ML Mesorhizobium loti (NodP, gil3476292); Be, Bradyrhi- 
zobium elkanii (NodP, gil 4209498); Ab, Azospirillum brasilense (NodP, 
giI42424); Ec, E. coli (CysD, gil ZS 17206); Ka, Klebsiella aerogenes 
(CysD, gi!1992146); Xo, X. oryzae pv oryzae (NodP, gi21105248); Mt, 
Mycobacterium tuberculosis (CysD, gil 5 608425); Sc. Streptomyces coeli- 
color (CvsD. gi21224427); At. Agrobacterium tumefaciens (CysD, 
gil5155798). S, fuJly sulfated Nod factors; S/NS, mixture of sulfated 
and nonsulfated Nod factors; NS, nonsulfated Nod factors; APR, APS- 
reducing activity; PAPR, PAPS-reducing activity; ?APR, putative 
APR-reducing activity; ?, APS or PAPS reductase activity unknown; 
genome sequence not (fully) determined. Similar open reading frames 
are shaded identically. Note that nod PI of S. meliloti is located on 
megaplasmid 1, nodP2 is on megaplasmid 2, and cysHDN is chromo- 
somally located. nodPI of Sinorhizobium sp. strain BR816 is located on 
a megaplasmid, nodP2 is on the symbiotic plasmid, and cysHDN is 
chromosomallv located. 



made for S. meliloti and /?. tropici. In summary, these data 
support the ATP sulfurylase activity of the putative proteins 
encoded by the isolated BR816 cysDN genes. 

Phylogenetic analysis of CysD and CysN homologues. The 
BR816 CysD and CysN ATP sulfurylase subunits were com- 
pared through multiple-sequence alignment (ClustalW) with 
homologous ATP sulfurylases subunits retrieved from Gen- 
Bank. The genomic organizations of the different sulfate as- 
similation loci of the strains selected for the phylogenetic anal- 
ysis are schematically drawn in Fig. 2, Phylogenetic analysis of 
cysD and nodP gene products by the protein parsimony method 
resulted in a maximum-parsimony tree, as shown in Fig. 3. An 
identical tree topology could be inferred by using the neighbor- 
joining method (data not shown). Similar phylogenetic rela- 
tionships could be deduced after construction of a phyloge- 
netic dendrogram of CysN and NodQ protein sequences by 
using either the neighbor-joining method or protein parsimony 
analysis (data not shown). 

It can be observed that the CysD and NodP ATP sulfurylase 
subunits of Rhizobium spp. producing sulfated Nod factors 
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FIG. 3. Phylogenetic relationships among cysD gene products. The 
tree topology was inferred by using the protein parsimony method. 
Numbers represent the bootstrapping score (9) over 1,000 trials (par- 
simony/distance). The abbreviations of the species are as for Fig. 2. 



(Fig. 2), which have been shown to be involved in amino acid 
biosynthesis (1, 23) and Nod factor sulfation (24, 25, 47). re- 
spectively, cluster in two different groups (Fig. 3). The CysD 
protein of Sinorhizobium sp. strain BR816 clearly belongs to 
the protein cluster involved in biosynthesis of sulfur-containing 
amino acids, supporting its putative function. 

Two other "household" clusters could be distinguished, i.e., 
the y-Proteobacteria clade and the Actinobacteria clade. Inter- 
estingly, only one gene copy coding for a sulfate activation 
complex has been described, for Mycobacterium tuberculosis 
(cysDNC) (Fig. 2) (58). The sulfate assimilation pathway of 
Mycobacterium tuberculosis proceeds from sulfate through APS 
(catalyzed by CysDN), which is converted by APS reductase 
(CysH) in the first step toward cysteine and methionine. APS 
can also be converted to PAPS, through the action of the APS 
kinase CysC, and serves as a substrate for sulfo transferases 
that produce sulfolipids, which putatively function as virulence 
factors (58). Similarly, APS and PAPS pools are generated 
through the enzymatic activity of RaxP and RaxQ in X. oryzae 
pv. oryzae and are used for both cysteine synthesis and sulfa- 
tion of avirulence effector molecules (48). 

The CysD-homologous proteins of some members of the 
Rhizobiaceae (among which are Mesorhizobium loti, producing 
nonsulfated Nod factors [29, 34]; Mesorhizobium sp. strain 
N33, producing sulfated Nod factors [35]; and the pathogen 
Brucella melitensis) seem to belong to another cluster. How- 
ever, these proteins are still more closely related to the NodP 
Nod factor sulfation cluster than to the CysD household clus- 
ter, as defined above. Brucella melitensis was previously shown 
to be genetically closely related to Rhizobium spp. (14). In- 
triguingly, the respective Brady rhizobium elkanii and Azospiril- 
lum brasilense ATP sulfurylase subunits constitute a separate 
cluster (Fig. 3). The nodPQ genes of B. elkanii are situated 
within a gene cluster comprising genes for symbiotic functions 
(fixGHlS and noeE) as well as genes involved in rhizobitoxin 
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biosynthesis (59). Since the B. elkanii Nod factors are not 
sulfated (4, 43), these genes do not function in Nod factor 
biosynthesis. The recently finished genome sequencing of the 
p90 plasmid of A. brasilense sheds new light on a possible 
function of its nodPQ copy, which is located within a region 
carrying genes involved in polysaccharide synthesis (E. Van- 
bleu and J. Vanderleyden, unpublished results). It was previ- 
ously shown that A. brasilense does not synthesize Nod factors 
and that deletion of the nodPQ copy does not lead to auxot- 
rophy (54). Therefore, it can be speculated that this cluster 
encompasses proteins belonging to a novel functionality group. 

Growth characteristics of Sinorhizobium sp. strain BR816 
cysD mutants under free-living conditions. To investigate the 
biochemical role of the isolated cysDN genes of BR816, the 
BR816 cysD gene was mutated (see Materials and Methods). 
First, the cysD mutants were tested for cysteine auxotrophy. In 
addition, we were interested to know whether a cysD mutation 
could be complemented by one or both nodP copies of Sino- 
rhizobium sp. strain BR816. Growth of the wild type and var- 
ious mutants with mutations in nodPQ and/or cysDN 
(FAJ1600, FAJ1604, CFNE205, CFNE206, CFNE207, and 
CFNE208) was examined in liquid sulfate-free AMS medium 
supplemented with various sulfur sources (see Materials and 
Methods). It could be demonstrated that the BR816 nodPQ 
single or double mutants (CFNE205, CFNE206, CFNE207, 
and CFNE208) exhibit growth patterns similar to that of the 
wild-type strain in minimal medium with sulfate as the sole 
sulfur source (data not shown). Therefore, it can be concluded 
that nodPQ mutants are not auxotrophs. Growth of the cysD 
mutant (FAJ1600) with sulfate as the sole sulfur source was 
clearly affected compared to that of the wild-type strain (Fig. 
4A). FAJ1600 showed a prolonged lag phase, although its 
generation time in exponential growth phase did not markedly 
differ from that of the wild type. The nod PI nodQ2 cysD triple 
mutant (FAJ1604) was completely impaired in growth (Fig. 
4A). In the presence of sulfite, cysteine, or methionine, the 
growth of both mutants after 60 h was nearly restored to the 
wild-type level (Fig. 4B to D). This indicates that the cysDN 
genes are effectively involved in the biosynthesis of sulfur- 
containing amino acids, more specifically in the step of the 
sulfate assimilatory pathway just before the reduction of acti- 
vated sulfate to sulfite. From this experiment we can conclude 
that knocking out the three sulfate activation systems 
(FAJ1604) in Sinorhizobium sp. strain BR816 leads to cysteine 
auxotrophy. 

Interestingly, the growth characteristics of FAJ1600 showed 
a course similar to that of the wild type after a certain time 
interval. This demonstrates that the PAPS pool generated by 
the NodPQ sulfate activation complex is accessible for reduc- 
tion by CysH and thus is available for the biosynthesis of 
sulfur-containing amino acids. The growth delay of FA J 1600 
might indicate that CysH of Sinorhizobium sp. strain BR816 
preferentially shows APS reductase activity rather than PAPS 
reductase activity toward the formation of sulfite. Moreover, 
the APS reductase activity of CysH has been recently con- 
firmed in many rhizobial species (1, 20). 

One should consider that (i) the growth curves of the wild- 
type and mutant strains were monitored under conditions in 
which no Nod factors are produced (no flavonoid induction) 
and (ii) nodP2, which is localized in the nodulation region on 
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FIG. 4. Effects of various sulfur sources on cell growth of Sinorhi- 
zobium sp. strain BR816 wild-type and mutant strains determined by 
measuring optical density at 600 nm (OD600) in a BioscrecnC instru- 
ment over a 4-day period. Thick black line, BR816; gray line. FAJ1600; 
thin gray line, FAJ1604). Cultures were grown at 30°C in sulfate-free 
AMS medium supplemented with sodium sulfate (A), sodium sulfite 
(B), L-cysteine (C)» or L-methionine (D) at a concentration of 25 jxM. 
Each experiment was conducted three times. Results from one exper- 
iment are shown. 
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FIG. 5. Autoradiogram of a reverse-phase TLC profile of butanol 
extracts of radioactively labeled Sinorhizobium sp. strain BR816 (A). 
FAJ1600 (B), and FAJ1604 (C). Lanes 1 and 2. ,4 C labeling; lanes 3 
and 4, 35 S labeling. Lanes 1 and 3. noninduced; lanes 2 and 4, apigenin 
induced. Spots representing sulfated Nod factors are indicated with 
arrows. 



the symbiotic plasmid, probably is nod box dependent and thus 
not expressed (49). Therefore, to investigate whether the si- 
multaneous production of sulfated Nod factors affects growth 
characteristics of the cysDN mutant strains, similar growth tests 
were performed in the presence of the nod gene inducer api- 
genin and with sulfate as the sole sulfur source. In this case, 
similar growth courses were obtained for FAJ1600 and 
FAJ1604 compared to the wild type (data not shown). This 
implies that at least the expressed nodPQ copy can comple- 
ment and is sufficient for growth of FAJ1600 in minimal me- 
dium with sulfate as the sole sulfur source. The use of higher 
concentrations of inducer did not have a significant effect on 
the growth curves of the strains tested. 

Nod factor sulfation pattern of Sinorhizobium sp. strain 
BR816 cysD mutants. Since the available nodPQ single and 
double mutants of Sinorhizobium sp. strain BR816 (CFNE205, 
CFNE206. CFNE207, and CFN.E208 [Table I]) were not auxo- 
trophic and still produced sulfated Nod factors. Laeremans et 
al. (25) speculated that the housekeeping cysDN(C) genes can 
complement mutations in genes responsible for Nod factor 
sulfation. In order to determine to what level the Nod factors 
produced by the wild-type strain and the mutant strains 
FAJ1600 and FAJ1604 were still sulfated, apigenin-induced 
cell cultures, grown in liquid sulfate-free AMS medium sup- 
plemented with cysteine, were labeled with [ l4 C]acetate or 
[ 35 S]sulfate, and butanol extracts of the cell cultures were an- 
alyzed by reverse-phase TLC. Separation of the BR816 Nod 
factors revealed the presence of apigenin-induced spots on the 
chroma togram, corresponding to the Nod factors of BR816 
(Fig. 5). The triple mutant FAJ1604 no longer produced sul- 
fated Nod factors, which is in clear contrast with the sulfated 
Nod factor pattern of both the wild-type strain and FAJ1600 
(Fig. 5). These results indicate that an activated sulfate source 
needed for Nod factor sulfation is no longer present. It can be 
concluded that the cysDN sulfate assimilation locus does pro- 
vide active sulfate for NF sulfation. 

Symbiotic phenotype of cysD mutants. The Sinorhizobium 
sp. strain BR816 cysD mutants were tested for their ability to 
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FTG. 6. Nodulation kinetics of P. vulgaris BAT477 inoculated with 
Sinorfiizobium sp. strain BR816 wild-type and mutant strains. Two 
independent experiments were set up, and the results of one experi- 
ment are shown. 



nodulate common bean (P. vulgaris cv. BAT477) and to fix 
nitrogen. No significant differences in the kinetics of appear- 
ance of the first nodules were observed (Fig. 6). However, 
FAJ1600 (cysD) as well as FAJ1604 (nodPJ nodQ2 cysD) 
showed a decreased nodule number per plant over time, but 
only for FAJ1604 was this difference significant at the 95% 
level (Tukey's test). Morphologically, the nodules of both mu- 
tant strains were generally smaller with apparently less leghe- 
moglobin present (as judged by the absence of pink color). 

To study the nitrogen fixation capacity of the nodulated 
roots, the acetylene reduction activity was measured. The acet- 
ylene reduction activity of 21 -day-old nodules induced by 
FAJ1600 or FAJ1604 was significantly lower than that for the 
wild-type strain (P < 0.05; Tukey's test) (data not shown). 
When methionine was added to the plant nutrient solution, the 
nitrogen fixation per plant was restored to wild-type levels. 
Interestingly, supplementation with methionine resulted in an 
overall higher nitrogen fixation capacity' of P. vulgaris cv. 
BAT477 inoculated with Sinorhizobium sp. strain BR816 (data 
not shown). 

DISCUSSION 

In this study, a third APS-producing locus of the broad-host- 
range strain Sinorhizobium sp. strain BR816 was isolated. The 
nucleotide sequence of this region was determined, and based 
on homology searches, cysD and cysN were identified. Like in 
S. meliloti, no cysC homologuc could be isolated downstream 
from cysDN. This is an indication that, like in other rhizobia, 
APS rather than PAPS is reduced to sulfite for cysteine bio- 
synthesis (1). The highest similarity was found with the cysDN 
homologues in 5. meliloti, supporting the close phylogenetic 
relationship between S. mdiloti and Sinorhizobium sp. strain 
BR816 (15). Phylogenetic analysis revealed that CysD does not 
cluster with NodPl and NodP2. The two BR816 NodP proteins 
are closely related and could have originated from a recent 
gene duplication, as was proposed for the NodP proteins of S. 
meliloti (13). Within the a-Proteobacteria clade, two clusters of 
proteins are clearly functionally distinguished and were desig- 
nated NodP Nod factor sulfation and CvsD household. It has 
been demonstrated that the nodPQ genes are also required for 
sulfation of S. meliloti iipopolysaccharide, proving a dual func- 
tionality of members of the NodP Nod factor sulfation cluster 
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FIG. 7. Schematic representation of the distribution of APS and 
PAPS for sulfation and methylation processes in Sinorhizobium sp. 
strain BR816. Dotted arrows indicate possible but less favorable en- 
zyme activity. 



(5, 17). A potential new NodP-like protein cluster is proposed, 
comprising proteins involved in sulfate activation for sulfation 
of compounds that are yet unknown but which could be im- 
portant during symbiosis. Other closely related CysD an( | 
NodP homologous do not fit into a specific functionality group, 
since these proteins are involved either in sulfation of amino 
acids (M. loti and B. melitensis) or in sulfation of Nod factors 
{Mesorhizobium sp. strain N33). It should be noted that within 
the y~Proteobacteria clade and the Actinohacteiia clade, only 
one copy of genes encoding sulfate activating enzymes is 
present, which seems to be involved in biosynthesis of sulfur- 
containing amino acids as well as sulfation of other macromol- 
ecules. 

We examined the effect of a cysD mutation under free-living 
conditions in a wild-type chromosomal background and in a 
nodPl nodQ2 double mutant background. The levels of Nod 
factor sulfation (Fig. 5) as well as the growth characteristics 
(Fig. 4) of the different mutants were determined. In this study, 
we could demonstrate that the household CysDN sulfate acti- 
vation locus of BR816 can additionally ensure Nod factor sul- 
fation, whereas the symbiotic (P)APS pool, generated by the 
nodPQ sulfate activation complexes, can be engaged for sulfa- 
tion of amino acids. Figure 7 shows a model of how the various 
forms of activated sulfate in Sinorhizobium sp. strain BR816 
may be partitioned into the pathways for amino acid biosyn- 
thesis and sulfation of Nod factors and other compounds that 
might be important during symbiosis. The cysDAT-dependent 
APS pool supplies activated sulfate that is subsequently re- 
duced to form sulfite by the CysH APS reductase. Sulfite is 
further reduced to sulfide, which is then incorporated into the 
cysteine and methionine biosynthesis pathway. Our data sug- 
gest that the symbiotic APS and/or PAPS pool, created by the 
AiorfP£>-dependent sulfate activation step, can also be used by 
CysH (in a less efficient manner) for the biosynthesis of sulfur- 
containing amino acids, when needed. Moreover, both house- 
hold and symbiotic APS pools can be mutually exchanged. In S. 
meliloti, the nodPQ- and cysDN-e ncoded sulfate activation sys- 
tems cannot substitute for each other (46, 47). 

Why would Sinorhizobium sp. strain BR816 possess three 



functional sulfate activation systems for Nod factor sulfation? 
Besides the use of activated sulfate for the biosynthesis of 
sulfur-containing amino acids and sulfation of Nod factors, 
(P)APS is needed for Nod factor methylation (37). Introduc- 
tion of the S. meliloti nodPQ genes into R. tropici resulted in a 
decreased rate of R. tropici Nod factor methylation, while all R. 
tropici Nod factor backbones were sulfated. Waelkens et al. 
(57) showed that methylation of Nod factors is required for 
nodulation of bean. In Sinorhizobium sp. strain BR816. the 
three operational sulfate- activating systems could play an im- 
portant role in maintaining substitutions of bacterial determi- 
nants for symbiosis. 

An R. tropici nodPQ mutant (producing drastically reduced 
amounts of sulfated Nod factors) and an R. tropici nodH mu- 
tant (producing nonsulfated Nod factors) still activate the sig- 
naling cascade for emergence of effective nodules on P. vulgaris 
roots (12, 24). For bean plants, the sulfate moiety of the Nod 
factor was shown to be involved in the efficiency of nodule 
formation but appears not to be essential (11, 22). The effects 
of the cysD mutant FAJ1600 and the nodPl nodQ2 cysD triple 
mutant FAJ1604 on bean symbiosis were seen mainly in the 
reduction of nodule number per plant. Since under free-living 
conditions, a auDN-dependent biosynthesis of sulfur-contain- 
ing amino acids is essential to allow optimal growth of Sino- 
rhizobium sp. strain BR816 with sulfate as the sole sulfur 
source, bacterial growth defects are likely the main reason for 
the decreased nitrogen fixation of bean plants inoculated with 
the mutants FAJ1600 and FAJ1604. These defects can be re- 
stored by the addition of methionine to the plant nutrient 
solution. We propose that at the early stages of the nodulation, 
the plant root exudates of the germinated seedlings provide 
enough sources of organic sulfur to allow bacterial growth. 
However, a shortage of an organic sulfur source like methio- 
nine impairs bacterial growth inside the plant. Inoculation 
experiments with a Rhizobium etli metZ (0-succinylhomoserine 
sulfhydrylase for methionine biosynthesis) (51) mutant on 
bean plants resulted in the formation of ineffective (Nod 4 " 
Fix - ) nodules, which suggested that root cells do not supply 
the inoculant bacteria with enough methionine. The fact that 
supplemented methionine resulted in an overall higher nitro- 
gen fixation capacity of P. vulgaris BAT477 inoculated with 
BR816 strains supports this hypothesis. In contrast to our ob- 
servations, an R. etli cysG (siroheme synthetase for cysteine 
biosynthesis) mutant, which is able to induce the formation of 
effective nodules (Nod^ Fix 4 ) on the roots of common bean, 
seems to dispose of an organic sulfur source like cysteine or 
glutathione to allow growth inside the plant (52). 

How can the strictly separated symbiotic and endogenous 
(P)APS pools in S. meliloti versus the complementary (P)APS 
pools in Sinorhizobium sp. strain BR816 be explained? Pre- 
sumably, the nodPQ genes arose in ancestral rhizobial strains 
through duplications of the endogenous cysDNC genes. Later, 
these nodPQ genes evolved toward more specialized symbiotic 
genes, whereas the endogenous cysC gene, encoding the APS 
kinase, was apparently lost during evolution. At this stage, 
complementation between both PAPS pools was still possible 
(the case of Sinorhizobium sp. strain BR816). Then, the genetic 
separation of the two sulfate -activating systems could have 
further evolved into two more efficient and energy-saving sep- 
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arate enzymatic multienzyme complexes (the case of S. me- 
liloti). 
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